Recent comments in /f/MachineLearning
Select_Beautiful8 t1_jbqyth8 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Thanks. I'm actually using the oobabooga text generation webui on github
KerfuffleV2 t1_jbqtx6j wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Note: I'm just a random person on the internet, no affiliation to OP. I also don't really know what I'm doing here, so follow my advice at your own risk.
cuda fp16i8 *16 -> cpu fp32 as the strategy means use 16 fp16i8 format CUDA layers and then put the rest on the CPU (as fp32). So if you want to reduce how many layers go to the GPU, you'd reduce "16" there.
Assuming we're talking about the same thing, you'd have the ChatRWKV repo checked out and be editing v2/chat.py
There should be a line like:
args.strategy = 'cuda fp16i8 *16 -> cpu fp32'
Either make sure other other lines setting args.strategy in that area are commented out or make sure the one with the setting you want to use is the last one. (Otherwise the other variable assignment statements would override what you added.)
appenz t1_jbqsu7k wrote
Reply to [D] What's the Time and Space Complexity of Transformer Models Inference? by Smooth-Earth-9897
Both of the answers above are correct and if you care about the structure (i.e. depth, layers etc.) of the transformer it is complicated.
If you only care about scaling with the number of weights, most transformers scale with O(weights) and a generative transformer like GPT scales approximately with 2*weights.
Select_Beautiful8 t1_jbqpd5x wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
>How do I reduce the CUDA layers?
KerfuffleV2 t1_jbqo9qh wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
You might have to reduce the CUDA layers by 1-3, but with only 16GB RAM you're probably going to have trouble.
If you still run out of CUDA memory trying to load it, then maybe you're not setting the strategy correctly. How are you trying to change it?
ilovekungfuu t1_jbql8h9 wrote
Reply to comment by hcarlens in [R] Analysis of 200+ ML competitions in 2022 by hcarlens
Thank you (you cleared my doubt) !
Hostilis_ t1_jbqh1fm wrote
Reply to [D] What's the Time and Space Complexity of Transformer Models Inference? by Smooth-Earth-9897
In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).
There is generally not much difference between different transformer architectures in terms of the computational complexity.
Select_Beautiful8 t1_jbq9m13 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
No, I wasn't able to load the 7B model, it still says CUDA out of memory :(
kevindamm t1_jbq3w44 wrote
Reply to [D] What's the Time and Space Complexity of Transformer Models Inference? by Smooth-Earth-9897
The analysis isn't as straightforward as that, for a few reasons. Transformer architectures are typically a series of alternating Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) networks. The MHA may merge the heads from multiple MLPs. Each layer in the network is dominated by a matrix multiply and if it were all being computed on a CPU then a reasonable upper bound would be O(n^3 ) where n is the widest layer. But the bottleneck isn't based on how many multiplies a CPU would have to do because we are typically using a GPU or TPU to process it and these can parallelize a lot of the additions and multiplies of the matrix ops. The real bottleneck is often the memory copies going to and from the GPU or TPU, and this will vary greatly based on the model size, GPU memory limits, batch processing size, etc.
You're better off profiling performance for a particular model and hardware combination.
potatoandleeks t1_jbpu6m1 wrote
Reply to comment by LetMeGuessYourAlts in [D] Is it possible to train LLaMa? by New_Yak1645
Good point
djaym7 t1_jbpnn87 wrote
Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
No Paper is the blocker
Select_Beautiful8 t1_jbp7qq7 wrote
Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
Thanks. I have a laptop 3060 and 16GB of RAM, and I successfully ran the 3B one; I will try with the 7B one.
ortegaalfredo OP t1_jbov7dl wrote
Reply to comment by SpaceCockatoo in [R] Created a Discord server with LLaMA 13B by ortegaalfredo
Tried the 8bit, 4bit for some reason don't work yet for me.
Problem is, those are very very slow, about 1 token/sec, compared with 13B I'm getting 100 tokens/s
KerfuffleV2 t1_jboquv7 wrote
Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng
If it helps, I was able to get the 7B model going on a GTX 1060 with 6GB VRAM also. The strategy I used wascuda fp16i8 *16 -> cpu fp32 — starting out with about 1.2G vram already in use from other programs and desktop environment, it went up to about 5.6G which would be about 0.275G/layer. So on a 6GB card with fp16i8 it seems like even with totally free VRAM you could load 21, maybe 22 layers at the maximum and half that for the normal fp16 format. This model: RWKV-4-Pile-7B-20230109-ctx4096
It generates a token every 2-3sec which is is too slow for interactive use but still pretty impressive considering the model size and how old the hardware is (my CPU is just a Ryzen 5 1600 also). It's also running half the layers on the CPU. By the way, it also uses about 14GB RAM to run, so you'll need a decent amount of system memory available as well.
Tagging /u/bo_peng also in case this information is helpful for them. (One interesting thing I noticed is the GPU was only being used about 50% of the time, I guess while the CPU inference was run. I don't know if it's possible, but if there was some way to do both in parallel it seems like it would roughly double the speed of token generation.)
LetMeGuessYourAlts t1_jboc0o3 wrote
Reply to comment by QTQRQD in [D] Is it possible to train LLaMa? by New_Yak1645
You're right they have FB Marketplace why would they use CL?
hpoddar2810 t1_jbo8i4l wrote
Reply to [D] I’m a Machine Learning Engineer for FAANG companies. What are some places looking for freelance / contract work for ML? by doctorjuice
Hi, I am an MLE with 1.5 YOE. I am also looking for some side gig. Hit me up if anyone needed.
hcarlens OP t1_jbnrg6z wrote
Reply to comment by scaldingpotato in [R] Analysis of 200+ ML competitions in 2022 by hcarlens
Good point! Edited to clarify.
hcarlens OP t1_jbnreef wrote
Reply to comment by ilovekungfuu in [R] Analysis of 200+ ML competitions in 2022 by hcarlens
Hi! I'm not sure I fully understand your question, but if you're asking whether the rate of progress in competitive ML is slowing down, I think probably not. A lot of the key areas of debate (gbdt vs nn in tabular data, cnn vs transformers in vision) are seeing a lot of research still and I expect the competitive ML community to adopt new advances when they happen. Also in NLP there's a move towards more efficient models, which would also be very useful.
UnusualClimberBear t1_jbnoo8n wrote
Reply to comment by potatoandleeks in [D] Is it possible to train LLaMa? by New_Yak1645
You can rent some (but not thousands) on vast.ai around $1.5 an hour
QTQRQD t1_jbnmcv7 wrote
Reply to comment by potatoandleeks in [D] Is it possible to train LLaMa? by New_Yak1645
you really think Meta spent 30 million on GPUs and then sold them on craigslist?
potatoandleeks t1_jbnl6se wrote
Reply to comment by UnusualClimberBear in [D] Is it possible to train LLaMa? by New_Yak1645
Wow, they cost $15k a piece. So that's $30 million just for the GPUs! But since you only need them for 21 days, can probably sell them later on craigslist
[deleted] t1_jbnk1y2 wrote
Reply to [D] Is it possible to train LLaMa? by New_Yak1645
[deleted]
czl t1_jbnh9p6 wrote
Reply to comment by Dendriform1491 in [D] chatGPT and AI ethics by [deleted]
> ChatGPT is unethical, because it can always be tricked to do the wrong thing despite any instruction it is given to it.
Unethical means "not morally correct."
The term you likely want is amoral which means lacking a moral sense; unconcerned with the rightness or wrongness of something.
czl t1_jbnh0dw wrote
Reply to comment by currentscurrents in [D] chatGPT and AI ethics by [deleted]
> I think #2 is intractable. People have already been arguing about ethics for millenia, and the existence of AI doesn't make it any easier.
Long arguments over many things have been settled by research. Is there any objective reason this may not happen to arguments about ethics?
My POV as to why machines running simulations may help us improve ethics: https://reddit.com/comments/11nenyo/comment/jbn6rys
Life is complex but more and more we can use machines to model aspects of it and perform predictions and from those pick changes that lead to desirable outcomes.
multiverseportalgun t1_jbr55gh wrote
Reply to comment by Hostilis_ in [D] What's the Time and Space Complexity of Transformer Models Inference? by Smooth-Earth-9897
Quadratic 🤢