Recent comments in /f/MachineLearning

Delacroid t1_jb4c3xt wrote

I don't think so. If you look at the figure and check the angle between whole dataset backprop and minibatch backprop, increasing the learning rate wouldn't change that angle. Only the scale of the vectors.

Also, dropout does not (only) introduce noise, it prevents coadaptation of neurons. In the same way that in random forest each forest is trained on a subset on the data (bootstrapping I think it's called) the same happens for neurons when you use dropout.

I haven't read the paper but my intuition says thattthe merit of dropout for early stages of training could be that the bootstrapping is reducing the bias of the model. That's why the direction of optimization is closer to the whole dataset training.

3

PassionatePossum t1_jb4977c wrote

Agreed. Sometimes theoretical analysis doesn't transfer to the real world. And sometimes it is also valuable to see a complete system. Because the whole training process is important.

However, since my days in academia are over, I am much less interested in getting the next 0.5% of performance out of some benchmark dataset. In industry you are way more interested in a well-working solution that you can produce quickly instead of the best-performing solution. So, I am way more interested in a tool set of ideas that generally work well and ideally a knowledge of what the limitations are.

And yes, while papers about applications can provide practical validation of these ideas, very few of these papers conduct proper ablation studies. And in most cases it is also too much to ask. Pretty much any application is a complex system with an elaborate pre-processing and training procedure. You cannot practically evaluate the influence of every single step and parameter. You just twiddle around with the parameters you deem to be most important and that is your ablation study.

5

Mr_Smartypants t1_jb3tlu5 wrote

> We begin our investigation into dropout training dynamics by making an intriguing observation on gradient norms, which then leads us to a key empirical finding: during the initial stages of training, dropout reduces gradient variance across mini-batches and allows the model to update in more consistent directions. These directions are also more aligned with the entire dataset’s gradient direction (Figure 1).

Interesting. Has anyone looked at optimally controlling the gradient variance with other means? I.e. minibatch size?

2

tysam_and_co t1_jb3i6eq wrote

Right, right, right, though I don't see how dropout introduces bias into the network. Sure, we're subsampling the network in general, but overall the information integrated with respect to a minibatch should be less on the whole due to gradient noise, right? So the bias should be less and as a result we have more uncertainty, then more steps equals more integration time of course and on we go from there towards that elusive less-biased estimator.

I guess the sticking point is _how_ they're saying that dropout induces bias. I feel like fitting quickly in a non-regularized setting has more bias by default, because I believe the 0-centered noise should end up diluting the loss signal. I think. Right? I find this all very strange.

6

jobeta t1_jb3hh74 wrote

This is cool and I haven’t finished reading it yet but, intuitively, isn’t that roughly equivalent to have a higher learning rate in the beginning? You make the learning algorithm purposefully imprecise at the beginning to explore quickly the loss landscape and later on, once a rough approximation of a minimum has been found, you are able to explore more carefully to look for a deeper minimum or something? Like the dropout introduces noise doesn’t it?

6

amhotw t1_jb38ai5 wrote

Based on what you copied: they are saying that dropout introduces bias. Hence, it reduces the variance.

Here is why it might be bothering you: bias-variance trade-off makes sense if you are on the efficient frontier, ie cramer-rao bound should hold with equality for trade-off to make sense. You can always have a model with a higher bias AND a higher variance; introducing bias doesn't necessarily reduce the variance.

17

BrotherAmazing t1_jb37vx3 wrote

It’s sort of a “clickbait” title I didn’t like myself even if it’s a potentially interesting paper.

Usually we assume dropout helps prevent overfitting, not help with underfitting, but the thing I don’t like about the title is it makes it sound like dropout helps with underfitting in general. It does not and they don’t even claim it does—even by the time you finish reading their Abstract you can tell that they’re only saying dropout has been observed to help with underfitting in certain circumstances when used in certain ways only.

I can come up with low dimensional counter-examples where dropout won’t help you when you’re underfitting, and will necessarily be the cause of the underfitting for example.

6

ThaGooInYaBrain t1_jb342rd wrote

Reply to comment by ggdupont in To RL or Not to RL? [D] by vidul7498

> "In October 2022, DeepMind unveiled a new version of AlphaZero, called AlphaTensor, in a paper published in Nature. The version discovered a faster way to perform matrix multiplication – one of the most fundamental tasks in computing – using reinforcement learning."

Matrix multiplication is a pretty damn practical real life application, no?

3