Dropkickmurph512 t1_j9sxa1j wrote on February 24, 2023 at 9:20 AM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

Agree about the overparamatized models but learning the noise definitely doesn't help. It's mostly from measurements error/quantization and other stuff that is not in the vector space of the signals you care about. It is why early stopping can be useful and actually acts as a regularizer. If you want to a good example look into denoising properties of deep image prior. It can remove noise by training on a single image and stop before learning the image completely.

suflaj t1_j9swm0z wrote on February 24, 2023 at 9:11 AM

Reply to comment by junetwentyfirst2020 in Why bigger transformer models are better learners? by begooboi

I'm not sure what you mean. I'm using the usual definition of noise.

junetwentyfirst2020 t1_j9rghm4 wrote on February 24, 2023 at 12:57 AM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

You’re being very loose with the word noise here.

OnceReturned t1_j9rb03o wrote on February 24, 2023 at 12:18 AM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

>We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

Solutions for what, exactly? Memorizing the entire internet (or entire training set, but still)?

suflaj t1_j9qectx wrote on February 23, 2023 at 8:46 PM

Reply to comment by levand in Why bigger transformer models are better learners? by begooboi

That is peanuts for the datasets they are trained on. We're talking datasets in the order of terabytes, and the model doesn't usually even iterate over more than 10% of that. So you can't even overfit a model unless you're dealing with duplicates because you will never even go through the whole dataset.

Even if the model had 1 trillion parameters and iterated over the whole dataset, it would be too small for the number of relations contained within a dataset of 1 trillion+ bytes. AND THAT'S IF THEY WERE LINEAR, which they are (usually) NOT.

So there is large overhead in needing multiple sets of parameters to define just one type of relation. Not to mention that some of these models are trained on data pairs, which means the SQUARE of that number of relations. We're talking about physically impossible number of parameters here, which will require solutions radically different that simple matrix multiplication and nonlinear activations.

levand t1_j9qe7ev wrote on February 23, 2023 at 8:45 PM

Reply to comment by suflaj in Why bigger transformer models are better learners? by begooboi

> These models are too small to truly overfit on their datasets.

I thought we were talking about 175 billion parameters, literally some of the biggest models in existence? Although it is true that at some point models get big enough that they become less prone to overfitting (and it's not clear why): https://openai.com/blog/deep-double-descent/

Moderatecat t1_j9qcd1k wrote on February 23, 2023 at 8:34 PM

Reply to Why bigger transformer models are better learners? by begooboi

We don't know

artsybashev t1_j9puq9o wrote on February 23, 2023 at 6:45 PM

Reply to comment by levand in Why bigger transformer models are better learners? by begooboi

It is in a way the same phenomena. If you think about information in images, overfitting would start to learn even the noise patterns in the images. If your training data does not have enough real information to fill the model capacity, the model will start to learn noise and overfit to your data.

suflaj t1_j9puhj0 wrote on February 23, 2023 at 6:43 PM

Reply to comment by levand in Why bigger transformer models are better learners? by begooboi

Overfit on what? These models are too small to truly overfit on their datasets.

They overfit on noise, which seems to be one reason for their good performance, so it's something you want. And that is not a bad thing - learning what the noise is helps generalization. Once the model starts figuring out noise, it can go beyond what the data would usually allow it to in terms of generalization.

EDIT: Also, a larger model is more easily internally restructured. Overparametrized models are sort of like very big rooms. It's easier to rearrange the same furniture in a larger room, than it is in a smaller room.

Dropkickmurph512 t1_j9pnws1 wrote on February 23, 2023 at 6:03 PM

Reply to Why bigger transformer models are better learners? by begooboi

NKT theory kinda looks into this but for more general case. The math be wilden though. Real answer is that no one knows the real reason.

levand t1_j9pmtz7 wrote on February 23, 2023 at 5:56 PM

Reply to comment by Appropriate_Ant_4629 in Why bigger transformer models are better learners? by begooboi

Well that’s not the whole story, a bigger model is also more prone to over fitting depending on the training data.

weightedflowtime t1_j9phj57 wrote on February 23, 2023 at 5:24 PM

Reply to Why bigger transformer models are better learners? by begooboi

It's a great question!

Appropriate_Ant_4629 t1_j9p4z0e wrote on February 23, 2023 at 4:05 PM

Reply to Why bigger transformer models are better learners? by begooboi

A bigger array holds more information than a smaller one.

^(You'd need to refine your question. It's obvious that a bigger model could outperform a smaller one -- simply by noticing that it could be identical to the smaller one by just setting the rest of it weights to zero. For every single one of those weights, if there's any value better than zero, the larger model would be better.)

No-Celebration6994 OP t1_j9n0chj wrote on February 23, 2023 at 3:19 AM

Reply to comment by junetwentyfirst2020 in Entry to a career in deep learning by No-Celebration6994

Thanks!

shawon-ashraf-93 t1_j9khu9r wrote on February 22, 2023 at 5:22 PM

Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

Okay. Have a nice day good sir. :)

suflaj t1_j9khr8s wrote on February 22, 2023 at 5:22 PM

Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

The burden on proof is on you, since you initially claimed that there will be benefits.

CKtalon t1_j9kfuhx wrote on February 22, 2023 at 5:10 PM

Reply to comment by buzzz_buzzz_buzzz in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

Funny how the RTX 6000 ADA doesn’t have NVLink as well

lovehopemisery t1_j9k4ql9 wrote on February 22, 2023 at 4:00 PM

Reply to Is my plan for a 4x RTX 3090 Machine Learning rig feasible? by [deleted]

Have you considered the cost of training/ running your models on the cloud compared to this? It seems like a big investment to rush into

shawon-ashraf-93 t1_j9k3j42 wrote on February 22, 2023 at 3:51 PM

Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

Post the benchmarks :) I’m not the one gaslighting here.

suflaj t1_j9k3bqn wrote on February 22, 2023 at 3:50 PM

Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

The 300 GB/s, which was its theoretical upper limit of it in a MULTI-GPU workload did not show a significant difference in benchmarks. Please do not gaslight people into believing it did.

shawon-ashraf-93 t1_j9jwos2 wrote on February 22, 2023 at 2:34 PM

Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

NVLink doesn’t have to be gaming specific. Anything that requires high band low latency data transfer will benefit from it. There’s no point of 24gigs of vram if you can’t transfer data between gpus faster in a multi gpu setting.

suflaj t1_j9jwfc9 wrote on February 22, 2023 at 2:32 PM

Reply to comment by shawon-ashraf-93 in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

OK, and now an actual argument?

Or are you one of those people who unironically believe NVLink enabled memory pooling or things like that?

shawon-ashraf-93 t1_j9jw9vv wrote on February 22, 2023 at 2:31 PM

Reply to comment by suflaj in Bummer: nVidia stopping support for multi-gpu peer to peer access with 4090s by mosalreddit

Don’t embarrass yourself over things you don’t know.

SpareAnywhere8364 t1_j9jvayc wrote on February 22, 2023 at 2:24 PM

Reply to Is my plan for a 4x RTX 3090 Machine Learning rig feasible? by [deleted]

Unless you have a giant case and very powerful.cooling.those.csrds are gonna cook eachother and throttle like a motherfucker.

jakecoolguy t1_j9jd1hg wrote on February 22, 2023 at 11:43 AM

Reply to Is my plan for a 4x RTX 3090 Machine Learning rig feasible? by [deleted]

I would seriously consider also getting a lot of RAM and a beast of a cpu with a lot of cores if you’re going multiple gpus. A lot of machine learning applications (especially less optimised ones) require stuff to be done on the cpu in parallel with gpu and you don’t want to get bottlenecked

Recent comments in /f/deeplearning