itsnotmeyou t1_jb2z8z7 wrote on March 6, 2023 at 1:08 AM

Reply to [D] Best way to run LLMs in the cloud? by QTQRQD

Are you using these as in a system? For just experimenting around, ec2 is good option. But you would either need to install right drivers or use latest deep learning ami. Another option could be using a custom docker setup on sagemaker. I like that setup for inference as it’s super easy to deploy and separates model from inference code. Though it’s costlier and would be available through sagemaker runtime.

Third would be whole over engineering via setting up your own cluster service.

In general if you want to deploy multiple llm quickly go for sagemaker

[deleted] t1_jb2yygm wrote on March 6, 2023 at 1:06 AM

Reply to [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

[removed]

vzq t1_jb2mdvz wrote on March 5, 2023 at 11:26 PM

Reply to comment by Ronny_Jotten in [D] Open source recommendations for a conversational AI by habilkantur

Yeah sorry typo

tysam_and_co t1_jb2kgo2 wrote on March 5, 2023 at 11:12 PM

Reply to comment by tysam_and_co in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Hold on a minute. On reading through the paper again, this section stood out to me:

>Bias-variance tradeoff. This analysis at early training can be viewed through the lens of the bias-variance tradeoff. For no-dropout models, an SGD mini-batch provides an unbiased estimate of the whole-dataset gradient because the expectation of the mini-batch gradient is equal to the whole-dataset gradient. However, with dropout, the estimate becomes more or less biased, as the mini-batch gradients are generated by different sub-networks, whose expected gradi- ent may not match the full network’s gradient. Nevertheless, the gradient variance is significantly reduced, leading to a reduction in gradient error. Intuitively, this reduction in variance and error helps prevent the model from overfitting to specific batches, especially during the early stages of training when the model is undergoing significant changes

Isn't this backwards? It's because of dropout that we should receive _less_ information from each iteration update, which means that we should be _increasing_ the variance of the model with respect to the data, not decreasing it. We've seen in the past that dropout greatly increases the norm of the gradients over training -- more variance. And we can't possibly add more bias to our training data with random I.I.D. noise, right? Shouldn't this effectively slow down the optimization of the network during the critical period, allowing it to integrate over _more_ data, so now it is a better estimator of the underlying dataset?

I'm very confused right now.

bdambrosio94563 t1_jb2ct4n wrote on March 5, 2023 at 10:16 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

I've spent the last week exploring gpt-3.5-turbo. Went back to text-davinci. (1) gpt-3.5-turbo is incredibly heavily censored. For example, good luck getting anything medical out of it other than 'consult your local medical professional'. It also is much more reluctant to play a role. (2) As is well documented, it is much more resistant to few-shot training. Since I use it in several roles, including google search information extraction and response-composition, I find it very dissappointing.

Luckily, my use case is as my personal companion / advisor / coach, so my usage is low enough I can afford text-davinci. Sure wish there was a middle-ground, though.

[deleted] t1_jb2acxm wrote on March 5, 2023 at 9:58 PM

Reply to comment by ElleLeonne in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

[deleted]

Superschlenz t1_jb27s3e wrote on March 5, 2023 at 9:40 PM

Reply to comment by ninjasaid13 in Did you get access to Meta AI's LLAMA? [Discussion] by WittyBananaPeel

It got hijacked by Tesla's Optimus ;-)

Just kidding. It's the last two options starting with No.

RSchaeffer t1_jb26p98 wrote on March 5, 2023 at 9:33 PM

Reply to [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Lucas Beyer made a relevant comment: https://twitter.com/giffmana/status/1631601390962262017

"""

The main reason highlighted is minibatch gradient variance (see screenshot).

This immediately asks for experiments that can validate or nullify the hypothesis, none of which I found in the paper

"""

godaspeg t1_jb24g8c wrote on March 5, 2023 at 9:17 PM

Reply to comment by plunki in [D] Open source recommendations for a conversational AI by habilkantur

Yes, there are torrents. However, to build a conversational AI just a good LLM (+ the hardware to run it) is probably not satisfying, it has to be finetuned for chat-like conversations to get sth like chatgpt.

plunki t1_jb224v0 wrote on March 5, 2023 at 9:01 PM

Reply to [D] Open source recommendations for a conversational AI by habilkantur

Have the weights for llama been posted anywhere yet?

Ronny_Jotten t1_jb20aiy wrote on March 5, 2023 at 8:48 PM

Reply to comment by vzq in [D] Open source recommendations for a conversational AI by habilkantur

*not as good

xx14Zackxx t1_jb1zk8v wrote on March 5, 2023 at 8:43 PM

Reply to comment by royalemate357 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).

I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.

bo_peng OP t1_jb1z3an wrote on March 5, 2023 at 8:40 PM

Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.

godaspeg t1_jb1z1ie wrote on March 5, 2023 at 8:40 PM

Reply to [D] Open source recommendations for a conversational AI by habilkantur

Wait for Open Assistant to be trained.

[deleted] t1_jb1xdiw wrote on March 5, 2023 at 8:28 PM

Reply to comment by JEFFREY_EPSTElN in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

[removed]

[deleted] t1_jb1xcfg wrote on March 5, 2023 at 8:28 PM

Reply to comment by WikiSummarizerBot in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

[removed]

TheCockatoo t1_jb1x9k4 wrote on March 5, 2023 at 8:27 PM

Reply to comment by szidahou in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Found reviewer #2. Read the paper!

_Arsenie_Boca_ t1_jb1wjfi wrote on March 5, 2023 at 8:22 PM

Reply to comment by bo_peng in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

[deleted] t1_jb1vdtd wrote on March 5, 2023 at 8:14 PM

Reply to comment by Chadssuck222 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

[removed]

WikiSummarizerBot t1_jb1uuaw wrote on March 5, 2023 at 8:10 PM

Reply to comment by JEFFREY_EPSTElN in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Dilution (neural networks)

>Dilution and dropout (also called DropConnect) are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network. Both trigger the same type of regularization.

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

JEFFREY_EPSTElN t1_jb1usu4 wrote on March 5, 2023 at 8:10 PM

Reply to comment by xXWarMachineRoXx in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Research, News
A regularization technique for training neural networks https://en.wikipedia.org/wiki/Dilution_(neural_networks)

vzq t1_jb1uo6n wrote on March 5, 2023 at 8:09 PM

Reply to [D] Open source recommendations for a conversational AI by habilkantur

I think the only candidate is GPT-J.

It’s NOT as good as the OpenAI stuff, but it’s better than nothing.

EDIT: inserted NOT

Quazar_omega t1_jb1uf1u wrote on March 5, 2023 at 8:07 PM

Reply to comment by 2blazen in [P] LazyShell - GPT based autocomplete for zsh by rumovoice

Lmao yeah, good point

alterframe t1_jb1txte wrote on March 5, 2023 at 8:04 PM

Reply to comment by jobeta in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Early stochastic depth. That's where you take a ResNet and randomly drop residual connections so that the effective depth of the network randomly changes.

Nextil t1_jb1sg1c wrote on March 5, 2023 at 7:54 PM

Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

I think they mean with offloading/streaming you need 3GB minimum, but it's much slower.

Recent comments in /f/MachineLearning