Recent comments in /f/MachineLearning
AvailablePresent1113 t1_j6cf8e9 wrote
Reply to [D] CVPR Reviews are out by banmeyoucoward
I really want to know what is the minimum scores after rebuttal to get accepted? As after rebuttal things should only be A, WA, WR, R, will 3 WA secure acceptance or can 2 WA, 1 WR get any chance?
[deleted] t1_j6cf3ao wrote
Reply to [D] Simple Questions Thread by AutoModerator
[deleted]
trnka t1_j6ceex5 wrote
Reply to comment by ant9zzzzzzzzzz in [D] Simple Questions Thread by AutoModerator
I think curriculum learning is the name. Here's a recent survey. I've seen it in NLP tasks where it can help to do early epochs on short inputs. Kinda like starting kids with short sentences.
I haven't heard of anyone adjusting the labels at each stage of curriculum learning though.
trnka t1_j6ce4td wrote
Reply to comment by eltorrido23 in [D] Simple Questions Thread by AutoModerator
I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.
If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.
In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.
andreichiffa t1_j6c9xf1 wrote
Reply to comment by visarga in Few questions about scalability of chatGPT [D] by besabestin
That’s a very bold claim that flies in the face of pretty much all the research on the subject to the date.
Surely you have extraordinary evidence to support such extraordinary claims?
eltorrido23 t1_j6c4bwq wrote
Reply to [D] Simple Questions Thread by AutoModerator
I’m currently starting to pick up ML with a quant focused social scientist background. I am wondering what I am allowed to do in EDA (on the whole data set) and what not, to avoid „data leakage“ or information gain which might eventually ruin my predictive model. Specifically, I am wondering about running linear regressions in the data inspection phase (as this is what I would often do in my previous work, which was more about hypothesis testing and not prediction-oriented). From what I read and understand one shouldn’t really do that, because to much information might be obtained which might lead me to change my model in a way that ruins predictive power? However, in the course I am doing (Jose Portillas DS Masterclass) they are regularly looking at the correlations before separating train/test samples. But essentially linear regressions are also just (multiple/corrected) correlations, so therefore I am a bit confused where to draw the line in EDA. Thanks!
visarga t1_j6c3eg2 wrote
Reply to comment by MfDoomer222 in [D] Microsoft ChatGPT investment isn't about Bing but about Cortana by fintechSGNYC
The water levels were lower in the past and there was a land bridge, and today you can cross by Channel Tunnel, there are a few immigrants that sneaked in Calais to walk to Dover along the train tracks.
visarga t1_j6c1rmo wrote
Reply to comment by currentscurrents in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
Humans are harder to scale, and it took billions of years for evolution to get here, with enormous resource and energy usage. A brain trained by evolution is already fit for the environment niche it has to inhabit. But an AI model has none of that, no evolution selecting the internal structure to be optimal. So they have to compensate by learning these things from tons of raw data. We are great at some tasks that relate to our survival, but bad at other tasks, even worse than other animals or AIs - we are not generally intelligent either.
Also, most AIs don't have real time interaction with the world. They only have restricted text interfaces or APIs, no robotic bodies, no way to do interventions to distinguish causal relations from correlations. When an AI has feedback loop from the environment it gets much better at solving tasks.
visarga t1_j6c0o3e wrote
Reply to comment by golongandprosper in Few questions about scalability of chatGPT [D] by besabestin
I very much doubt they do this in real time. The model is responding too fast for that.
They are probably used for RLHF model alignment: to keep it polite, helpful and harmless, and to generate more samples of tasks being solved by vetting our chatGPT interaction logs, or using the model from the console like us to solve tasks, or effectively writing the answers themselves where the model fails.
visarga t1_j6c0e8m wrote
Reply to comment by besabestin in Few questions about scalability of chatGPT [D] by besabestin
They might use a second model to flag abuse, not once every token, but once every line or phrase. Their models are already trained to avoid being abused, but this second model is like insurance in case the main one doesn't work.
visarga t1_j6c01ua wrote
Reply to comment by vivehelpme in Few questions about scalability of chatGPT [D] by besabestin
> But yeah there's really no secret sauce to it.
Of course there is - it's data. They keep their mix of primary training sets with organic text, multi-task fine-tuning, code training and RLHF secret. We know only in general lines what they are doing, but details matter. How much code did they train on? it matters. How many tasks? 1800 like FLAN T5 or much more, like 10,000? We have no idea. Do they reuse the prompts to generate more training data? Possibly. Others don't have their API logs because they had no demo.
visarga t1_j6bzixy wrote
Reply to comment by andreichiffa in Few questions about scalability of chatGPT [D] by besabestin
> without increasing the dataset, bigger model do nothing better
Wrong, bigger models are better than small models even when both are trained on exactly the same data. Bigger models reach the same accuracy using fewer examples. Sometimes using a bigger model is the solution to having too little data.
visarga t1_j6bz9e7 wrote
Reply to comment by binheap in Few questions about scalability of chatGPT [D] by besabestin
Model security is the security of Google's revenues if they release the model. chatGPT is very insecure for their ad clicks, it will crash their income. /s
DadSnare t1_j6bwxss wrote
Reply to comment by [deleted] in [R] InstructPix2Pix: Learning to Follow Image Editing Instructions by Illustrious_Row_9971
Check my post history for some ways I’m using it.
currentscurrents t1_j6btqta wrote
Reply to comment by visarga in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
Frankly though, there's got to be a way to do with less data. The typical human brain has heard maybe a million words of english and about 8000 hrs of video per year of life. (and that's assuming dreams are generative training data somehow - halve that if you only get to count the waking world)
We need something beyond transformers. They were a great breakthrough in 2018, but we're not going to get to AGI just by scaling them up.
StoicBatman t1_j6bsdgk wrote
Reply to [P] Launching my first ever open-source project and it might make your ChatGPT answers better by Vegetable-Skill-9700
I am new here, How it is helpful in making ChatGPT answers better?
[deleted] t1_j6bocy5 wrote
[removed]
VirtualHat t1_j6bi3xk wrote
Reply to comment by visarga in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
Video and audio might be the next frontier. Although, I'm not too sure how useful it would be. Youtube receives over 500 hours of uploads per minute, providing an essentially unlimited pipe of training data.
Illustrious_Row_9971 OP t1_j6bhakx wrote
txhwind t1_j6bgwnj wrote
"Artificial" Intelligence
Gody_Godee t1_j6ayw0r wrote
Reply to comment by bo_peng in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng
your idea looks like this one from 3 years ago: https://arxiv.org/abs/2006.16236
kirlandwater t1_j6artbc wrote
Reply to comment by Acceptable-Cress-374 in [P] Launching my first ever open-source project and it might make your ChatGPT answers better by Vegetable-Skill-9700
Not much what’s up with you
whilneville t1_j6ar901 wrote
The consistency is so stable, would be amazing to use a video as reference, not interested in 360 turntable tho
londons_explorer t1_j6al3tb wrote
Reply to comment by mocny-chlapik in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0
>They were not able to find significant improvements with scaling anymore.
GPT-3 has a window size of 2048 tokens ChatGPT has a window size of 8192 tokens. The compute cost is superliner, so I suspect the compute required for ChatGPT is a minimum of 10x what GPT-3 used. And GPT-3 cost ~12M USD. (At market rates - I assume they got a deep discount)
So I suspect they did scale compute as much as they could afford.
JustOneAvailableName t1_j6cfdmr wrote
Reply to comment by albertzeyer in [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ? by KarmaCut132
I worked with Wav2vec a year ago. WER on dutch was (noticeably) better when fine tuned than it was with GCP or Azure, and we didn't use any labeled own data. I used CTC mainly because it didn't reduce WER, hugely improved CER and made inference lots simpler. Inference cost was also a fraction (less than a cent per hour, assuming the GPU is fully utalized) of the paid services. I kinda assumed others got to the same conclusions I did back then, but my own conclusions, so plenty I could have done wrong.
Whisper offers this performance level practically out of the box, although with a lot higher inference costs. I, sadly, haven't had the time yet to finetune it. Nor have I found the time to optimize inference costs.
> E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec)
If you're okay with intermediary results getting improved later this is doable, although at a factor increased cost. Offline works like a charm though.
> Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.
True that.