Recent comments in /f/MachineLearning

itsnotmeyou t1_jb2z8z7 wrote

Are you using these as in a system? For just experimenting around, ec2 is good option. But you would either need to install right drivers or use latest deep learning ami. Another option could be using a custom docker setup on sagemaker. I like that setup for inference as it’s super easy to deploy and separates model from inference code. Though it’s costlier and would be available through sagemaker runtime.

Third would be whole over engineering via setting up your own cluster service.

In general if you want to deploy multiple llm quickly go for sagemaker

0

tysam_and_co t1_jb2kgo2 wrote

Hold on a minute. On reading through the paper again, this section stood out to me:

>Bias-variance tradeoff. This analysis at early training can be viewed through the lens of the bias-variance tradeoff. For no-dropout models, an SGD mini-batch provides an unbiased estimate of the whole-dataset gradient because the expectation of the mini-batch gradient is equal to the whole-dataset gradient. However, with dropout, the estimate becomes more or less biased, as the mini-batch gradients are generated by different sub-networks, whose expected gradi- ent may not match the full network’s gradient. Nevertheless, the gradient variance is significantly reduced, leading to a reduction in gradient error. Intuitively, this reduction in variance and error helps prevent the model from overfitting to specific batches, especially during the early stages of training when the model is undergoing significant changes

Isn't this backwards? It's because of dropout that we should receive _less_ information from each iteration update, which means that we should be _increasing_ the variance of the model with respect to the data, not decreasing it. We've seen in the past that dropout greatly increases the norm of the gradients over training -- more variance. And we can't possibly add more bias to our training data with random I.I.D. noise, right? Shouldn't this effectively slow down the optimization of the network during the critical period, allowing it to integrate over _more_ data, so now it is a better estimator of the underlying dataset?

I'm very confused right now.

17

bdambrosio94563 t1_jb2ct4n wrote

I've spent the last week exploring gpt-3.5-turbo. Went back to text-davinci. (1) gpt-3.5-turbo is incredibly heavily censored. For example, good luck getting anything medical out of it other than 'consult your local medical professional'. It also is much more reluctant to play a role. (2) As is well documented, it is much more resistant to few-shot training. Since I use it in several roles, including google search information extraction and response-composition, I find it very dissappointing.

Luckily, my use case is as my personal companion / advisor / coach, so my usage is low enough I can afford text-davinci. Sure wish there was a middle-ground, though.

1

xx14Zackxx t1_jb1zk8v wrote

It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).

I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.

3

bo_peng OP t1_jb1z3an wrote

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.

4

_Arsenie_Boca_ t1_jb1wjfi wrote

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

5

WikiSummarizerBot t1_jb1uuaw wrote

Dilution (neural networks)

>Dilution and dropout (also called DropConnect) are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network. Both trigger the same type of regularization.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

1