Recent comments in /f/deeplearning
suflaj t1_j9swm0z wrote
Reply to comment by junetwentyfirst2020 in Why bigger transformer models are better learners? by begooboi
I'm not sure what you mean. I'm using the usual definition of noise.
junetwentyfirst2020 t1_j9rghm4 wrote
Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi
You’re being very loose with the word noise here.
OnceReturned t1_j9rb03o wrote
Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi
>We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.
Solutions for what, exactly? Memorizing the entire internet (or entire training set, but still)?
suflaj t1_j9qectx wrote
Reply to comment by levand in Why bigger transformer models are better learners? by begooboi
That is peanuts for the datasets they are trained on. We're talking datasets in the order of terabytes, and the model doesn't usually even iterate over more than 10% of that. So you can't even overfit a model unless you're dealing with duplicates because you will never even go through the whole dataset.
Even if the model had 1 trillion parameters and iterated over the whole dataset, it would be too small for the number of relations contained within a dataset of 1 trillion+ bytes. AND THAT'S IF THEY WERE LINEAR, which they are (usually) NOT.
So there is large overhead in needing multiple sets of parameters to define just one type of relation. Not to mention that some of these models are trained on data pairs, which means the SQUARE of that number of relations. We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.
levand t1_j9qe7ev wrote
Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi
> These models are too small to truly overfit on their datasets.
I thought we were talking about 175 billion parameters, literally some of the biggest models in existence? Although it is true that at some point models get big enough that they become less prone to overfitting (and it's not clear why): https://openai.com/blog/deep-double-descent/
Moderatecat t1_j9qcd1k wrote
We don't know
artsybashev t1_j9puq9o wrote
Reply to comment by levand in Why bigger transformer models are better learners? by begooboi
It is in a way the same phenomena. If you think about information in images, overfitting would start to learn even the noise patterns in the images. If your training data does not have enough real information to fill the model capacity, the model will start to learn noise and overfit to your data.
suflaj t1_j9puhj0 wrote
Reply to comment by levand in Why bigger transformer models are better learners? by begooboi
Overfit on what? These models are too small to truly overfit on their datasets.
They overfit on noise, which seems to be one reason for their good performance, so it's something you want. And that is not a bad thing - learning what the noise is helps generalization. Once the model starts figuring out noise, it can go beyond what the data would usually allow it to in terms of generalization.
EDIT: Also, a larger model is more easily internally restructured. Overparametrized models are sort of like very big rooms. It's easier to rearrange the same furniture in a larger room, than it is in a smaller room.
Dropkickmurph512 t1_j9pnws1 wrote
NKT theory kinda looks into this but for more general case. The math be wilden though. Real answer is that no one knows the real reason.
levand t1_j9pmtz7 wrote
Reply to comment by Appropriate_Ant_4629 in Why bigger transformer models are better learners? by begooboi
Well that’s not the whole story, a bigger model is also more prone to over fitting depending on the training data.
weightedflowtime t1_j9phj57 wrote
It's a great question!
Appropriate_Ant_4629 t1_j9p4z0e wrote
A bigger array holds more information than a smaller one.
^(You'd need to refine your question. It's obvious that a bigger model could outperform a smaller one -- simply by noticing that it could be identical to the smaller one by just setting the rest of it weights to zero. For every single one of those weights, if there's any value better than zero, the larger model would be better.)
No-Celebration6994 OP t1_j9n0chj wrote
Reply to comment by junetwentyfirst2020 in Entry to a career in deep learning by No-Celebration6994
Thanks!
shawon-ashraf-93 t1_j9khu9r wrote
Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
Okay. Have a nice day good sir. :)
suflaj t1_j9khr8s wrote
Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
The burden on proof is on you, since you initially claimed that there will be benefits.
CKtalon t1_j9kfuhx wrote
Reply to comment by buzzz_buzzz_buzzz in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
Funny how the RTX 6000 ADA doesn’t have NVLink as well
lovehopemisery t1_j9k4ql9 wrote
Have you considered the cost of training/ running your models on the cloud compared to this? It seems like a big investment to rush into
shawon-ashraf-93 t1_j9k3j42 wrote
Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
Post the benchmarks :) I’m not the one gaslighting here.
suflaj t1_j9k3bqn wrote
Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
The 300 GB/s, which was its theoretical upper limit of it in a MULTI-GPU workload did not show a significant difference in benchmarks. Please do not gaslight people into believing it did.
shawon-ashraf-93 t1_j9jwos2 wrote
Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
NVLink doesn’t have to be gaming specific. Anything that requires high band low latency data transfer will benefit from it. There’s no point of 24gigs of vram if you can’t transfer data between gpus faster in a multi gpu setting.
suflaj t1_j9jwfc9 wrote
Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
OK, and now an actual argument?
Or are you one of those people who unironically believe NVLink enabled memory pooling or things like that?
shawon-ashraf-93 t1_j9jw9vv wrote
Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit
Don’t embarrass yourself over things you don’t know.
SpareAnywhere8364 t1_j9jvayc wrote
Unless you have a giant case and very powerful.cooling.those.csrds are gonna cook eachother and throttle like a motherfucker.
jakecoolguy t1_j9jd1hg wrote
I would seriously consider also getting a lot of RAM and a beast of a cpu with a lot of cores if you’re going multiple gpus. A lot of machine learning applications (especially less optimised ones) require stuff to be done on the cpu in parallel with gpu and you don’t want to get bottlenecked
Dropkickmurph512 t1_j9sxa1j wrote
Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi
Agree about the overparamatized models but learning the noise definitely doesn't help. It's mostly from measurements error/quantization and other stuff that is not in the vector space of the signals you care about. It is why early stopping can be useful and actually acts as a regularizer. If you want to a good example look into denoising properties of deep image prior. It can remove noise by training on a single image and stop before learning the image completely.