Recent comments in /f/MachineLearning

JustOneAvailableName t1_j6cfdmr wrote

I worked with Wav2vec a year ago. WER on dutch was (noticeably) better when fine tuned than it was with GCP or Azure, and we didn't use any labeled own data. I used CTC mainly because it didn't reduce WER, hugely improved CER and made inference lots simpler. Inference cost was also a fraction (less than a cent per hour, assuming the GPU is fully utalized) of the paid services. I kinda assumed others got to the same conclusions I did back then, but my own conclusions, so plenty I could have done wrong.

Whisper offers this performance level practically out of the box, although with a lot higher inference costs. I, sadly, haven't had the time yet to finetune it. Nor have I found the time to optimize inference costs.

> E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec)

If you're okay with intermediary results getting improved later this is doable, although at a factor increased cost. Offline works like a charm though.

> Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

True that.

1

trnka t1_j6ce4td wrote

I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.

If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.

In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.

1

eltorrido23 t1_j6c4bwq wrote

I’m currently starting to pick up ML with a quant focused social scientist background. I am wondering what I am allowed to do in EDA (on the whole data set) and what not, to avoid „data leakage“ or information gain which might eventually ruin my predictive model. Specifically, I am wondering about running linear regressions in the data inspection phase (as this is what I would often do in my previous work, which was more about hypothesis testing and not prediction-oriented). From what I read and understand one shouldn’t really do that, because to much information might be obtained which might lead me to change my model in a way that ruins predictive power? However, in the course I am doing (Jose Portillas DS Masterclass) they are regularly looking at the correlations before separating train/test samples. But essentially linear regressions are also just (multiple/corrected) correlations, so therefore I am a bit confused where to draw the line in EDA. Thanks!

1

visarga t1_j6c1rmo wrote

Humans are harder to scale, and it took billions of years for evolution to get here, with enormous resource and energy usage. A brain trained by evolution is already fit for the environment niche it has to inhabit. But an AI model has none of that, no evolution selecting the internal structure to be optimal. So they have to compensate by learning these things from tons of raw data. We are great at some tasks that relate to our survival, but bad at other tasks, even worse than other animals or AIs - we are not generally intelligent either.

Also, most AIs don't have real time interaction with the world. They only have restricted text interfaces or APIs, no robotic bodies, no way to do interventions to distinguish causal relations from correlations. When an AI has feedback loop from the environment it gets much better at solving tasks.

3

visarga t1_j6c0o3e wrote

I very much doubt they do this in real time. The model is responding too fast for that.

They are probably used for RLHF model alignment: to keep it polite, helpful and harmless, and to generate more samples of tasks being solved by vetting our chatGPT interaction logs, or using the model from the console like us to solve tasks, or effectively writing the answers themselves where the model fails.

1

visarga t1_j6c01ua wrote

> But yeah there's really no secret sauce to it.

Of course there is - it's data. They keep their mix of primary training sets with organic text, multi-task fine-tuning, code training and RLHF secret. We know only in general lines what they are doing, but details matter. How much code did they train on? it matters. How many tasks? 1800 like FLAN T5 or much more, like 10,000? We have no idea. Do they reuse the prompts to generate more training data? Possibly. Others don't have their API logs because they had no demo.

1

visarga t1_j6bzixy wrote

> without increasing the dataset, bigger model do nothing better

Wrong, bigger models are better than small models even when both are trained on exactly the same data. Bigger models reach the same accuracy using fewer examples. Sometimes using a bigger model is the solution to having too little data.

0

currentscurrents t1_j6btqta wrote

Frankly though, there's got to be a way to do with less data. The typical human brain has heard maybe a million words of english and about 8000 hrs of video per year of life. (and that's assuming dreams are generative training data somehow - halve that if you only get to count the waking world)

We need something beyond transformers. They were a great breakthrough in 2018, but we're not going to get to AGI just by scaling them up.

6

londons_explorer t1_j6al3tb wrote

>They were not able to find significant improvements with scaling anymore.

GPT-3 has a window size of 2048 tokens ChatGPT has a window size of 8192 tokens. The compute cost is superliner, so I suspect the compute required for ChatGPT is a minimum of 10x what GPT-3 used. And GPT-3 cost ~12M USD. (At market rates - I assume they got a deep discount)

So I suspect they did scale compute as much as they could afford.

4