Recent comments in /f/MachineLearning
Delacroid t1_jb4c3xt wrote
Reply to comment by jobeta in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
I don't think so. If you look at the figure and check the angle between whole dataset backprop and minibatch backprop, increasing the learning rate wouldn't change that angle. Only the scale of the vectors.
Also, dropout does not (only) introduce noise, it prevents coadaptation of neurons. In the same way that in random forest each forest is trained on a subset on the data (bootstrapping I think it's called) the same happens for neurons when you use dropout.
I haven't read the paper but my intuition says thattthe merit of dropout for early stages of training could be that the bootstrapping is reducing the bias of the model. That's why the direction of optimization is closer to the whole dataset training.
iloveintuition t1_jb4bp1o wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
Using vast.ai for running flan-xl, works pretty well. Haven't tested on LLama scale.
askljof t1_jb4bkf0 wrote
Reply to comment by deekaire in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Amazing reply đ¤
PassionatePossum t1_jb4977c wrote
Reply to comment by speyside42 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Agreed. Sometimes theoretical analysis doesn't transfer to the real world. And sometimes it is also valuable to see a complete system. Because the whole training process is important.
However, since my days in academia are over, I am much less interested in getting the next 0.5% of performance out of some benchmark dataset. In industry you are way more interested in a well-working solution that you can produce quickly instead of the best-performing solution. So, I am way more interested in a tool set of ideas that generally work well and ideally a knowledge of what the limitations are.
And yes, while papers about applications can provide practical validation of these ideas, very few of these papers conduct proper ablation studies. And in most cases it is also too much to ask. Pretty much any application is a complex system with an elaborate pre-processing and training procedure. You cannot practically evaluate the influence of every single step and parameter. You just twiddle around with the parameters you deem to be most important and that is your ablation study.
Philpax t1_jb471z4 wrote
Reply to comment by I_will_delete_myself in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
There's information about this in the README, but I'll admit that it's a little too technical and doesn't have a high-level description of the ideas. Looking forward to the paper!
cantfindaname2take t1_jb42qee wrote
Reply to comment by ggdupont in To RL or Not to RL? [D] by vidul7498
Isn't it extensively used in robotics??
speyside42 t1_jb425d4 wrote
Reply to comment by PassionatePossum in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
A good mixture is key. Independent applied research will show whether the claims of slight improvements hold in general. A counter example where "this kind of research" has failed us are novel optimizers.
pyonsu2 t1_jb3y5ps wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
maybe, Colab Pro+?
WandererXZZ t1_jb3wyfc wrote
Reply to comment by alterframe in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
It's actually, for every layer in the ResNet, dropping everything else except residual connections with a probability p. See this paper Deep Network with Stochastic Depth
rpnewc t1_jb3uwx9 wrote
Reply to comment by ComputerAttny in [D] Ethics of minecraft stable diffusion by NoLifeGamer2
Good to know.
Mr_Smartypants t1_jb3tlu5 wrote
> We begin our investigation into dropout training dynamics by making an intriguing observation on gradient norms, which then leads us to a key empirical finding: during the initial stages of training, dropout reduces gradient variance across mini-batches and allows the model to update in more consistent directions. These directions are also more aligned with the entire datasetâs gradient direction (Figure 1).
Interesting. Has anyone looked at optimally controlling the gradient variance with other means? I.e. minibatch size?
[deleted] t1_jb3t91m wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
[removed]
ComputerAttny t1_jb3lk7d wrote
Reply to comment by rpnewc in [D] Ethics of minecraft stable diffusion by NoLifeGamer2
Also worth noting that if you are an individual (ie not deep pockets) theyâll bring suit for an injunction. An injunction just demands you stop doing something; wonât be for money. So youâll just lose time/effort, not $$$.
tysam_and_co t1_jb3i6eq wrote
Reply to comment by amhotw in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Right, right, right, though I don't see how dropout introduces bias into the network. Sure, we're subsampling the network in general, but overall the information integrated with respect to a minibatch should be less on the whole due to gradient noise, right? So the bias should be less and as a result we have more uncertainty, then more steps equals more integration time of course and on we go from there towards that elusive less-biased estimator.
I guess the sticking point is _how_ they're saying that dropout induces bias. I feel like fitting quickly in a non-regularized setting has more bias by default, because I believe the 0-centered noise should end up diluting the loss signal. I think. Right? I find this all very strange.
jobeta t1_jb3hh74 wrote
This is cool and I havenât finished reading it yet but, intuitively, isnât that roughly equivalent to have a higher learning rate in the beginning? You make the learning algorithm purposefully imprecise at the beginning to explore quickly the loss landscape and later on, once a rough approximation of a minimum has been found, you are able to explore more carefully to look for a deeper minimum or something? Like the dropout introduces noise doesnât it?
ilyakuzovkin t1_jb3faxj wrote
Reply to comment by growqx in To RL or Not to RL? [D] by vidul7498
Point taken :) Not the best example, what I was aiming for was an example of a problem that is clearly best solved with some other computational framework than RL
amhotw t1_jb38ai5 wrote
Reply to comment by tysam_and_co in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Based on what you copied: they are saying that dropout introduces bias. Hence, it reduces the variance.
Here is why it might be bothering you: bias-variance trade-off makes sense if you are on the efficient frontier, ie cramer-rao bound should hold with equality for trade-off to make sense. You can always have a model with a higher bias AND a higher variance; introducing bias doesn't necessarily reduce the variance.
BrotherAmazing t1_jb37vx3 wrote
Reply to comment by Chadssuck222 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Itâs sort of a âclickbaitâ title I didnât like myself even if itâs a potentially interesting paper.
Usually we assume dropout helps prevent overfitting, not help with underfitting, but the thing I donât like about the title is it makes it sound like dropout helps with underfitting in general. It does not and they donât even claim it doesâeven by the time you finish reading their Abstract you can tell that theyâre only saying dropout has been observed to help with underfitting in certain circumstances when used in certain ways only.
I can come up with low dimensional counter-examples where dropout wonât help you when youâre underfitting, and will necessarily be the cause of the underfitting for example.
BrotherAmazing t1_jb36ydw wrote
Not a fan of the title they chose for this paper, as itâs really âDropout can reduce underfittingâ and not that it does in general.
Otherwise it may be interesting if this is re-produced/verified.
ThaGooInYaBrain t1_jb342rd wrote
Reply to comment by ggdupont in To RL or Not to RL? [D] by vidul7498
> "In October 2022, DeepMind unveiled a new version of AlphaZero, called AlphaTensor, in a paper published in Nature. The version discovered a faster way to perform matrix multiplication â one of the most fundamental tasks in computing â using reinforcement learning."
Matrix multiplication is a pretty damn practical real life application, no?
I_will_delete_myself t1_jb33bmz wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
Use a spot instance. If you testing it out you wallet will thank you later. Look at my previous post on here about running stuff in the cloud before you do it.
[deleted] t1_jb32vlo wrote
Reply to comment by I_will_delete_myself in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
[deleted]
I_will_delete_myself t1_jb32fo5 wrote
Reply to [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Whatâs the reason to use this over a transformers? Transformers allow transfer learning and is able to paralize easier. ĺćçĺ°ä˝ ççĽäšăä˝ ĺ¨äťäšĺ Źĺ¸ĺťĺˇĽä˝ďź
itsnotmeyou t1_jb2zfbq wrote
Reply to comment by itsnotmeyou in [D] Best way to run LLMs in the cloud? by QTQRQD
On a side note sagemaker was not supporting shm-size so might not work for large lm
kryptoklob t1_jb4c5f8 wrote
Reply to [P] diffground - A simplistic Android UI to access ControlNet and instruct-pix2pix. by radi-cho
Amazing. What base model are you using?