Recent comments in /f/MachineLearning
[deleted] t1_jb2yygm wrote
[removed]
vzq t1_jb2mdvz wrote
Reply to comment by Ronny_Jotten in [D] Open source recommendations for a conversational AI by habilkantur
Yeah sorry typo
tysam_and_co t1_jb2kgo2 wrote
Reply to comment by tysam_and_co in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Hold on a minute. On reading through the paper again, this section stood out to me:
>Bias-variance tradeoff. This analysis at early training can be viewed through the lens of the bias-variance tradeoff. For no-dropout models, an SGD mini-batch provides an unbiased estimate of the whole-dataset gradient because the expectation of the mini-batch gradient is equal to the whole-dataset gradient. However, with dropout, the estimate becomes more or less biased, as the mini-batch gradients are generated by different sub-networks, whose expected gradi- ent may not match the full network’s gradient. Nevertheless, the gradient variance is significantly reduced, leading to a reduction in gradient error. Intuitively, this reduction in variance and error helps prevent the model from overfitting to specific batches, especially during the early stages of training when the model is undergoing significant changes
Isn't this backwards? It's because of dropout that we should receive _less_ information from each iteration update, which means that we should be _increasing_ the variance of the model with respect to the data, not decreasing it. We've seen in the past that dropout greatly increases the norm of the gradients over training -- more variance. And we can't possibly add more bias to our training data with random I.I.D. noise, right? Shouldn't this effectively slow down the optimization of the network during the critical period, allowing it to integrate over _more_ data, so now it is a better estimator of the underlying dataset?
I'm very confused right now.
bdambrosio94563 t1_jb2ct4n wrote
Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
I've spent the last week exploring gpt-3.5-turbo. Went back to text-davinci. (1) gpt-3.5-turbo is incredibly heavily censored. For example, good luck getting anything medical out of it other than 'consult your local medical professional'. It also is much more reluctant to play a role. (2) As is well documented, it is much more resistant to few-shot training. Since I use it in several roles, including google search information extraction and response-composition, I find it very dissappointing.
Luckily, my use case is as my personal companion / advisor / coach, so my usage is low enough I can afford text-davinci. Sure wish there was a middle-ground, though.
[deleted] t1_jb2acxm wrote
Reply to comment by ElleLeonne in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
[deleted]
Superschlenz t1_jb27s3e wrote
Reply to comment by ninjasaid13 in Did you get access to Meta AI's LLAMA? [Discussion] by WittyBananaPeel
It got hijacked by Tesla's Optimus ;-)
Just kidding. It's the last two options starting with No.
RSchaeffer t1_jb26p98 wrote
Lucas Beyer made a relevant comment: https://twitter.com/giffmana/status/1631601390962262017
"""
​
The main reason highlighted is minibatch gradient variance (see screenshot).
This immediately asks for experiments that can validate or nullify the hypothesis, none of which I found in the paper
​
"""
godaspeg t1_jb24g8c wrote
Reply to comment by plunki in [D] Open source recommendations for a conversational AI by habilkantur
Yes, there are torrents. However, to build a conversational AI just a good LLM (+ the hardware to run it) is probably not satisfying, it has to be finetuned for chat-like conversations to get sth like chatgpt.
plunki t1_jb224v0 wrote
Have the weights for llama been posted anywhere yet?
Ronny_Jotten t1_jb20aiy wrote
Reply to comment by vzq in [D] Open source recommendations for a conversational AI by habilkantur
*not as good
xx14Zackxx t1_jb1zk8v wrote
Reply to comment by royalemate357 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).
I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.
bo_peng OP t1_jb1z3an wrote
Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).
TimeMixing is RWKV.
ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).
Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.
godaspeg t1_jb1z1ie wrote
Wait for Open Assistant to be trained.
[deleted] t1_jb1xdiw wrote
Reply to comment by JEFFREY_EPSTElN in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
[removed]
[deleted] t1_jb1xcfg wrote
Reply to comment by WikiSummarizerBot in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
[removed]
TheCockatoo t1_jb1x9k4 wrote
Reply to comment by szidahou in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Found reviewer #2. Read the paper!
_Arsenie_Boca_ t1_jb1wjfi wrote
Reply to comment by bo_peng in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.
What is the magic number 5 all about? It seems to appear all over the code without explanation.
Are the time mixing and channel mixing operations novel or were they introduced by a citable work?
How does the parallelization during training work?
[deleted] t1_jb1vdtd wrote
Reply to comment by Chadssuck222 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
[removed]
WikiSummarizerBot t1_jb1uuaw wrote
Reply to comment by JEFFREY_EPSTElN in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
>Dilution and dropout (also called DropConnect) are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network. Both trigger the same type of regularization.
^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)
JEFFREY_EPSTElN t1_jb1usu4 wrote
Reply to comment by xXWarMachineRoXx in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
-
Research, News
-
A regularization technique for training neural networks https://en.wikipedia.org/wiki/Dilution_(neural_networks)
vzq t1_jb1uo6n wrote
I think the only candidate is GPT-J.
It’s NOT as good as the OpenAI stuff, but it’s better than nothing.
EDIT: inserted NOT
Quazar_omega t1_jb1uf1u wrote
Reply to comment by 2blazen in [P] LazyShell - GPT based autocomplete for zsh by rumovoice
Lmao yeah, good point
alterframe t1_jb1txte wrote
Reply to comment by jobeta in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Early stochastic depth. That's where you take a ResNet and randomly drop residual connections so that the effective depth of the network randomly changes.
Nextil t1_jb1sg1c wrote
Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
I think they mean with offloading/streaming you need 3GB minimum, but it's much slower.
itsnotmeyou t1_jb2z8z7 wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
Are you using these as in a system? For just experimenting around, ec2 is good option. But you would either need to install right drivers or use latest deep learning ami. Another option could be using a custom docker setup on sagemaker. I like that setup for inference as it’s super easy to deploy and separates model from inference code. Though it’s costlier and would be available through sagemaker runtime.
Third would be whole over engineering via setting up your own cluster service.
In general if you want to deploy multiple llm quickly go for sagemaker