Kinexity t1_jbznlup wrote on March 12, 2023 at 11:24 PM

Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

There is a repo for CPU interference written in pure C++: https://github.com/ggerganov/llama.cpp

30B model can run on just over 20GB of RAM and take 1.2sec per token on my i7 8750H. Though actual Windows support has yet to arrive and as of right now the output is garbage for some reason.

Edit: fp16 version works. It's 4 bit quantisation that returns garbage.

remghoost7 t1_jbzmfku wrote on March 12, 2023 at 11:15 PM

Reply to comment by Amazing_Painter_7692 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Super neat. Thanks for the reply. I'll try that.

Also, do you know if there's a local interface for it....?

I know it's not quite the scope of the post, but it'd be neat to interact with it through a simple python interface (or something like how Gradio is used for A1111's Stable Diffusion) rather than piping it all through Discord.

MurlocXYZ t1_jbzk75t wrote on March 12, 2023 at 10:59 PM

Reply to comment by MinaKovacs in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

> A binary computer is nothing more than an abacus

I could say the same thing about the human brain. It's just a complex abacus.

Dendriform1491 t1_jbzj7zu wrote on March 12, 2023 at 10:52 PM

Reply to comment by ML4Bratwurst in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Wait until you hear about the 1/2 bit.

Co0k1eGal3xy t1_jbzi8wc wrote on March 12, 2023 at 10:45 PM

Reply to comment by TemperatureAmazing67 in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

Double Decent, more parameters are MORE data efficient.
Most of these LLMs barely complete 1 epoch, so there is no concern about overfitting currently.

poppear OP t1_jbzhgfh wrote on March 12, 2023 at 10:39 PM

Reply to comment by SpaceCockatoo in [P] vanilla-llama an hackable plain-pytorch implementation of LLaMA that can be run on any system (if you have enough resources) by poppear

I was thinking about it. It shouldn't be so hard, i will probably git it a try as soon as I will have some spare time 😀

-Rizhiy- t1_jbzfsqt wrote on March 12, 2023 at 10:27 PM

Reply to comment by f_max in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

> human level ai is probably worth more than all of big tech combined

What makes you say that? Where is the economic reasoning? For the vast majority of jobs human labour costs ~$10/hour, a 100T model will most likely cost much more to run. There is a lot of uncertainty with whether the current LLMs can be profitable.

I would say that actually the main reason stopping training of even larger LLMs, is that the economic model is not figured out yet.

stefanof93 t1_jbzeots wrote on March 12, 2023 at 10:19 PM

Reply to [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Anyone evaluate all the quantized versions and compare them against smaller models yet? How many bits can you throw away before you're better of picking a smaller version?

f_max t1_jbze2pl wrote on March 12, 2023 at 10:15 PM

Reply to [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

Speaking as someone working on scaling beyond gpt3 sizes, I think if there was proof of existence of human level ai at 100T parameters, then people would put down the money today to do it. It’s roughly $10m to train a 100B model. With rough scaling of cost with param size, it’s $10B to train this hypothetical 100T param ai. That’s the cost of buying a large tech startup. But a human level ai is probably worth more than all of big tech combined. The main thing stopping people is no one knows if the scaling curves will bend and we’ll hit a plateau in improvement with scale, so no one has the guts to put the money down.

Username912773 t1_jbze0ug wrote on March 12, 2023 at 10:14 PM

Reply to comment by TemperatureAmazing67 in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

That’s not a solution. That doesn’t make LLMs sentient it just makes them a cog in a larger machine.

Logic and task performance and sentience are different.

Zepb t1_jbzdpqf wrote on March 12, 2023 at 10:12 PM

Reply to [D] What's the mathematical notation for "top k argmax"? by fullgoopy_alchemist

You could use something like (x_1, i_1), (x_2, i_2), ..., (x_k,i_k), ... (first k tuples of value, index) with x_n >= x_m for every n < m (tuples must be ordered by value)

than use the i index numbers from the tuples

edit: just saw there is a similar approach to this with sets in another comment.

TemperatureAmazing67 t1_jbzcn6a wrote on March 12, 2023 at 10:05 PM

Reply to comment by hebekec256 in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

>extensions of LLMs (like
>
>PALM-E
>
>) are a heck of a lot more than an abacus. I wonder what would happen if Google just said, "screw it", and scaled it from 500B to 50T parameters. I'm guessing there are reasons in the architecture that it would

The problem is that we have scaling laws for NN. We just do not have the data for 50T parameters. We need somehow to get these data. The answer on this question costs a lot.

TemperatureAmazing67 t1_jbzc8cc wrote on March 12, 2023 at 10:02 PM

Reply to comment by Username912773 in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256

'require input to generate an output and do not have initiative' - use random or other's network output.

Also, the argument about next token is skrewed up. For a lot of task everything you need is perfectly predicted next token.

TywinASOIAF t1_jbzbm13 wrote on March 12, 2023 at 9:57 PM

Reply to comment by Toilet_Assassin in [D] Statsmodels ARIMA model predict function not working by ng_guardian

You have to make very weird code in Pandas to handle data on hour intervals (15:00, 16:00, 17:00 etc) with statmodels. In R this is no issue. Convert to tsibble and you are good to go.

Amazing_Painter_7692 OP t1_jbzbcmi wrote on March 12, 2023 at 9:55 PM

Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Should work fine with the 7b param model: https://huggingface.co/decapoda-research/llama-7b-hf-int4

pyepyepie t1_jbzax1m wrote on March 12, 2023 at 9:52 PM

Reply to comment by ShowerVagina in [N] AtMan could solve the biggest problem of ChatGPT by Number_5_alive

Happy to help :)

ShowerVagina t1_jbzafks wrote on March 12, 2023 at 9:49 PM

Reply to comment by pyepyepie in [N] AtMan could solve the biggest problem of ChatGPT by Number_5_alive

That makes sense. Thank you for the write up!

Taenk t1_jbzaeau wrote on March 12, 2023 at 9:49 PM

Reply to comment by kkg_scorpio in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Isn't 1-bit quantisation qualitatively different as you can do optimizations only available if the parameters are fully binary?

remghoost7 t1_jbz96lt wrote on March 12, 2023 at 9:40 PM

Reply to [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

><9 GiB VRAM

So does that mean my 1060 6GB can run it....? haha.

I doubt it, but I'll give it a shot later just in case.

pyepyepie t1_jbz9363 wrote on March 12, 2023 at 9:40 PM

Reply to comment by ShowerVagina in [N] AtMan could solve the biggest problem of ChatGPT by Number_5_alive

The TLDR of XAI is that you can "see" (or think you see) how features influence the decisions of your models. For example, if you have a sentence "buy this pill to get skinny!!!!!" and you try to classify if it's spam, the "!!!" might be marked as very spammy. You often find it by masking the "!!!" and seeing that now the message is maybe not classified as spam (often you look at the output dist). Of course, there are many more sophisticated methods to do so and there is a lot of impressive work, but it's the TLDR.

There are many explainability methods, it's a very hot topic. It might be yet another paper, or not. The title makes no sense at all, there are gazillion explainability methods for transformers. I am sorry, I did not read all of the paper so I should probably not talk too much. It just looks very similar to things I already saw.

Generally speaking, you should start using XAI if you do ML, if you do NLP - look into the proven methods, e.g. SHAP and LIME first. If you work with trees, look into TreeSHAP. If you work with vision, look into what I shared here. Sorry if my preceding comments were inaccurate but I hope I still provide some value here :).

kkg_scorpio t1_jbz91de wrote on March 12, 2023 at 9:39 PM

Reply to comment by Upstairs_Suit_9464 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Check out the terms "quantization aware training" and "post training quantization".

8-bit, 4-bit, 2-bit, hell even 1-bit inference are scenarios which are extremely relevant for edge devices.

[deleted] t1_jbz8xgk wrote on March 12, 2023 at 9:39 PM

Reply to [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

[removed]

Upstairs_Suit_9464 t1_jbz8dyt wrote on March 12, 2023 at 9:35 PM

Reply to comment by ML4Bratwurst in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

I have to ask… is this a joke or are people actually working on digitizing trained networks?

fastglow t1_jbz854s wrote on March 12, 2023 at 9:33 PM

Reply to [N] AtMan could solve the biggest problem of ChatGPT by Number_5_alive

This post does not specify the problem with ChatGPT as it purports to, nor does it solve anything.

KerfuffleV2 t1_jbz7yfk wrote on March 12, 2023 at 9:32 PM

Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

I've been playing with this for a bit and I actually haven't found any case where fp16i8 worked better than halving the layers and using fp16.

If you haven't already tried it, give something like cuda fp16 *7 -> cuda fp16 *0+ -> cpu fp32 *1 a try and see what happens. It's around twice as fast as cuda fp16i8 *16 -> cpu fp32 for me, which is surprising.

That one will use 7 fp16 layers on the GPU, and stream all the rest except the very last as fp16 on the GPU also. The 33rd layer gets run on the CPU. Not sure if that last part makes a big difference.

Recent comments in /f/MachineLearning