Recent comments in /f/MachineLearning

KerfuffleV2 t1_jbqtx6j wrote

Note: I'm just a random person on the internet, no affiliation to OP. I also don't really know what I'm doing here, so follow my advice at your own risk.

cuda fp16i8 *16 -> cpu fp32 as the strategy means use 16 fp16i8 format CUDA layers and then put the rest on the CPU (as fp32). So if you want to reduce how many layers go to the GPU, you'd reduce "16" there.

Assuming we're talking about the same thing, you'd have the ChatRWKV repo checked out and be editing v2/chat.py

There should be a line like:

args.strategy = 'cuda fp16i8 *16 -> cpu fp32'

Either make sure other other lines setting args.strategy in that area are commented out or make sure the one with the setting you want to use is the last one. (Otherwise the other variable assignment statements would override what you added.)

2

Hostilis_ t1_jbqh1fm wrote

In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).

There is generally not much difference between different transformer architectures in terms of the computational complexity.

3

kevindamm t1_jbq3w44 wrote

The analysis isn't as straightforward as that, for a few reasons. Transformer architectures are typically a series of alternating Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) networks. The MHA may merge the heads from multiple MLPs. Each layer in the network is dominated by a matrix multiply and if it were all being computed on a CPU then a reasonable upper bound would be O(n^3 ) where n is the widest layer. But the bottleneck isn't based on how many multiplies a CPU would have to do because we are typically using a GPU or TPU to process it and these can parallelize a lot of the additions and multiplies of the matrix ops. The real bottleneck is often the memory copies going to and from the GPU or TPU, and this will vary greatly based on the model size, GPU memory limits, batch processing size, etc.

You're better off profiling performance for a particular model and hardware combination.

19

KerfuffleV2 t1_jboquv7 wrote

If it helps, I was able to get the 7B model going on a GTX 1060 with 6GB VRAM also. The strategy I used wascuda fp16i8 *16 -> cpu fp32 — starting out with about 1.2G vram already in use from other programs and desktop environment, it went up to about 5.6G which would be about 0.275G/layer. So on a 6GB card with fp16i8 it seems like even with totally free VRAM you could load 21, maybe 22 layers at the maximum and half that for the normal fp16 format. This model: RWKV-4-Pile-7B-20230109-ctx4096

It generates a token every 2-3sec which is is too slow for interactive use but still pretty impressive considering the model size and how old the hardware is (my CPU is just a Ryzen 5 1600 also). It's also running half the layers on the CPU. By the way, it also uses about 14GB RAM to run, so you'll need a decent amount of system memory available as well.

Tagging /u/bo_peng also in case this information is helpful for them. (One interesting thing I noticed is the GPU was only being used about 50% of the time, I guess while the CPU inference was run. I don't know if it's possible, but if there was some way to do both in parallel it seems like it would roughly double the speed of token generation.)

2

hcarlens OP t1_jbnreef wrote

Hi! I'm not sure I fully understand your question, but if you're asking whether the rate of progress in competitive ML is slowing down, I think probably not. A lot of the key areas of debate (gbdt vs nn in tabular data, cnn vs transformers in vision) are seeing a lot of research still and I expect the competitive ML community to adopt new advances when they happen. Also in NLP there's a move towards more efficient models, which would also be very useful.

2

czl t1_jbnh9p6 wrote

> ChatGPT is unethical, because it can always be tricked to do the wrong thing despite any instruction it is given to it.

Unethical means "not morally correct."

The term you likely want is amoral which means lacking a moral sense; unconcerned with the rightness or wrongness of something.

1

czl t1_jbnh0dw wrote

> I think #2 is intractable. People have already been arguing about ethics for millenia, and the existence of AI doesn't make it any easier.

Long arguments over many things have been settled by research. Is there any objective reason this may not happen to arguments about ethics?

My POV as to why machines running simulations may help us improve ethics: https://reddit.com/comments/11nenyo/comment/jbn6rys

Life is complex but more and more we can use machines to model aspects of it and perform predictions and from those pick changes that lead to desirable outcomes.

1