[deleted] t1_jbjl3yh wrote on March 9, 2023 at 2:50 PM

Reply to comment by Novel-Ant-7160 in What is the future of AI in medicine? [D] by adityyya13

[deleted]

Striking-Travel-6649 t1_jbjkstq wrote on March 9, 2023 at 2:48 PM

Reply to comment by Acrobatic-Name5948 in [D] Why are so many tokens needed to train large language models? by blacklemon67

I think you're on the money. Once we develop more novel network and system structures that are really good at what they do while still generalizing, it will be game over. I think the current models that ML engineers have created are not complex or nuanced enough to extract the kind of value that humans can out of a "small" number of tokens. The human brain is great at having centralized control, coordination across systems, and effective interconnection, and each subsystem can do its "tasks" extremely well and can generalize across tasks too. With that in mind, we are going to need much more complex systems to achieve AGI.

harharveryfunny t1_jbjk9nb wrote on March 9, 2023 at 2:44 PM

Reply to comment by harharveryfunny in [D] Why are so many tokens needed to train large language models? by blacklemon67

Just to follow up, the reason why the "interact with the world" approach is way more efficient is because it's largely curiosity driven - we proactively try to fill gaps in our knowledge rather than just go read a set of encyclopedias and hope it might cover what we need to know. We learn in a much more targeted fashion..

bivouac0 t1_jbjk79f wrote on March 9, 2023 at 2:43 PM

Reply to [D] Why are so many tokens needed to train large language models? by blacklemon67

Truthfully, this has not been sufficiently researched and looking into this might yield improvements to LLMs. However it's also not completely surprising. Consider...

For Humans, something like 80% of a conversation is non-verbal (there are actual studies on this). This means that people get the meaning of words through other clues such as expression, tone, etc.. and thus our conversational inputs are much "richer" than simply a bunch of tokens.

You also need to consider that our verbal communication is augmented by a lot of other sensory input (ie.. visual). You learn what a "ball" is largely by seeing it, not hearing about it.

Also realize that LLMs generally use a very low learning rate (ie.. 1e-3) so a large number of tokens must be presented. It's not completely clear with people how this works but we do completely memorize some inputs (ie.. LR=1) and almost completely ignore others. This in itself could be an entire area of research. It would be good to understand why some phrases are "catchy" and others are forgettable. Obviously, AI today doesn't do this.

I'd also point out that LLMs are not exactly memorizing information. Studies have demonstrated their ability to learn facts but this is not purposeful knowledge retention. People have a better ability to do this and I suspect AI needs to develop a method to separate knowledge retention and language pattern modeling. Think about learning the state capitals. A person quickly learns to say "the capital of X is Y" and then can substitute in different memorized facts. AI learns the facts and the sentence patterns all in the same manner.

People can also use "thought" (ie.. search, hypothesis, etc..) to understand the meaning of sentences and form responses. Let's face it, at this point LLMs are just a brute force pattern matchers. There's nothing "intelligent" here.

[deleted] t1_jbjk077 wrote on March 9, 2023 at 2:42 PM

Reply to comment by Few_Pangolin4015 in [D] I’m a Machine Learning Engineer for FAANG companies. What are some places looking for freelance / contract work for ML? by doctorjuice

[deleted]

ThePerson654321 OP t1_jbjisn7 wrote on March 9, 2023 at 2:33 PM

Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

Sure. RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer. Comparing to, say, DALL-E 2 (that has exploded) which only came out 9 months ago it still feels like some organization would have picked RVWK if it was as useful as the developer claim.
This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.
Not necessarily. Google, OpenAI, Deepmind etc tests things that doesn't work out all the time.
Does not matter. If your idea is truly good you will get at attention sooner or later anyways.

I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

I personally have two potential explainations to my question:

It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
The community is basically really slow to embrace this due to some unknown reason.

I am leaning towards the first one.

harharveryfunny t1_jbjhmif wrote on March 9, 2023 at 2:25 PM

Reply to [D] Why are so many tokens needed to train large language models? by blacklemon67

Humans don't learn by locking themselves in a room at birth with a set of encyclopedias, or a print-out of the internet. We learn by interaction with the world - perceive/generalize/theorize/experiment, learn from feedback, etc.

It's impressive how well these LLM's perform given what is really a very tough task - build an accurate world model given only "predict next word" feedback, but hardly surprising that they need massive amounts of data to compensate for the task being so tough.

IntelArtiGen t1_jbjg4wk wrote on March 9, 2023 at 2:14 PM

Reply to [D] Why are so many tokens needed to train large language models? by blacklemon67

>Humans need substantially fewer tokens than transformer language models.

We don't use tokens the same way. In theory you could build a model with 10000 billion tokens, including one for each number up to a limit. Obviously humans can't and don't do that. We're probably closer to a model which would do "characters of a word => embedding". Some models do that but they also do "token => embedding" because it improves results and it's easier for the models to learn. Those who make these models may not really care about the size of the model if they have the right machine to train it and if they just want to have the best results on a task without constraints on size efficiency.

Most NLP models aren't efficient regarding their size. Though I'm not sure there currently is a way to keep having the best possible results without doing things like this. If I tell you "what happened in 2018?", you need to have an idea of what "2018" means, and that it's not just a random number. Either: (1) you know it's a year because you've learned this number like all other tokens (and you have to do that for many numbers / weird words and you have a big model), or (2) you think it's a random number, you don't need 1 token / number, your model is much smaller, but you can't answer these questions precisely, (3) you can re-build an embedding for 2018 knowing it's 2-0-1-8, you have an accurate "character => embedding" model.

I don't think we have a perfect solution for (3) so we usually do (1) & (3). But just doing (3) is the way to go for smaller NLP models... or putting much more weights for (3) and much less for (1).

So the size of NLP models doesn't really mean anything, you could build a model with 100000b parameters but 99.999% of these parameters won't improve the results a lot, and are only required to answer very specific questions. We should focus on building better "character => embedding" models and on ways to compress word embeddings if we care about the size of NLP models (easier said than done).

LetterRip t1_jbjfiyg wrote on March 9, 2023 at 2:10 PM

Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

The larger models (3B, 7B, 14B) have only been released quite recently
Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted
people want to know that it actually scales before investing work into it.
Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.

Acrobatic-Name5948 t1_jbjdcae wrote on March 9, 2023 at 1:53 PM

Reply to [D] Why are so many tokens needed to train large language models? by blacklemon67

If anyone knew this we would be created AGI already. Probably scale issues and some new ideas on top of deep learning.

bo_peng OP t1_jbiq52c wrote on March 9, 2023 at 9:38 AM

Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Please set "strategy" for you GPU.

Try this strategy for 3B first:

'cuda fp16i8 *12 -> cuda fp16' # first 12 layers cuda fp16i8, then cuda fp16

Reduce 12 as you could, to get better speed.

etesian_dusk t1_jbioscf wrote on March 9, 2023 at 9:18 AM

Reply to comment by chris_myzel in [N] tinygrad 0.5.0 released by Balance-

Comparing package size to "sore sourcecode" size is kind of misleading. The pytorch codebase by itself isn't 1 GB.

Also, in most usecases, I'd rather have pytorch's versatility, than be able to brag about <1000 lines.

[deleted] t1_jbio7jf wrote on March 9, 2023 at 9:10 AM

Reply to [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

[removed]

lifesthateasy t1_jbim6l5 wrote on March 9, 2023 at 8:40 AM

Reply to comment by crappleIcrap in [D] Blake Lemoine: I Worked on Google's AI. My Fears Are Coming True. by blabboy

Look it's really hard to argue with you when I present my finding and you're like "well I've never read anything of the like so it mustn't be true". Feel free to check this article, if you look closely, you'll find evidence of so-called "emergent abilities" are only emergent because we choose incorrect evaluation metrics, and once we choose ones that better describe the results and are not biased with usefulness to humans, you can see those metrics don't account for gradual improvements, and that's the only reason the abilities seem "emergent". If you consider a holistic model about something like GPT-3 and its aggregate performance along benchmarks, you can find the accuracy is smooth with scale. Emergent abilities would have an exponential scale. https://www.assemblyai.com/blog/emergent-abilities-of-large-language-models/ Since I can't post images here, check the image with the text "Aggregate performance across benchmarks for GPT-3 is smooth" in above article, which supports this notion.

So even *if* emergent abilities were a thing, and you'd argue consciousness is an emergent ability, there's data that shows there's nothing emergent about GPT's abilities, so then consciousness could also not have emerged.

Yes, GPT3 is the third round, and I'm saying GPT3 is static in its weights. It doesn't matter that they're making a GPT4, because I'm saying these models don't learn like we do. And they don't. GPT4 is a separate entity. Even *if* GPT3 had a conscience, it would have no connection to GPT4 as they're separate entities in a separate space of hardware, while human consciousness evolves within the same "hardware" and never stops learning. It even adds new connections until the end of our lives, which GPT3 doesn't (and yes, you're severely misinformed on that 25 year age barrier, that's an antiquated notion. To prevent you form going "well I've never read that" again, here's an article with plenty more to support it if you can google: https://cordis.europa.eu/article/id/123279-trending-science-do-our-brain-cells-die-as-we-age-researchers-now-say-no: "New research shows that older adults can still grow new brain cells." ). You can't even compare GPT3 to 4 in brain/human consciousness terms, because GPT4 will have a different architecture and quite likely even trained on different data. So it's not like GPT3 learns and evolves, no, GPT3 is set and GPT4 will be a separate thing - *completely unlike* human consciousness.

About determinism, I don't know if you're misunderstanding me on purpose, but what I'm saying is an artificial neuron in an NN has one activation function, one input and one output (even though the output can be and often is a vector or a matrix). At best it's bidirectional, but even bidirectionality is solved with separate pathways that go back, activation functions themselves are feedforward and to the same input they always give the same output. Brain cells however, are not only multidirectional without extra backwards connections, but they can keep some residual electric charge that can change the output (both its direction and strength) based on that residual charge. This residual activation can have a number of effects on the neuron's firing behavior, including increasing the strength of subsequent firing events and influencing the direction and timing of firing.

Since I can't be arsed to type any more, here's someone else who can explain it to you why brain neurons and artificial neurons are fundamentally different: https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7 Even this article has some omissions, and I want to highlight how in the past we though neurons would always fire when getting a stimulus, and start firing when they stopped getting the stimulus (as artificial neurons do), but in fact there's been new discoveries showing that human neurons also exhibit persistent activity: neural firing that continues after the triggering stimulus goes away.

Select_Beautiful8 t1_jbijcjl wrote on March 9, 2023 at 8:01 AM

Reply to comment by bo_peng in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

I tried the 3B and it said out of memory. I'm now trying 1B5 and it loads correctly.

bo_peng OP t1_jbij8ky wrote on March 9, 2023 at 7:59 AM

Reply to comment by Select_Beautiful8 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Try 7B ctx4096 first

hrishikamath t1_jbii1e4 wrote on March 9, 2023 at 7:43 AM

Reply to [D] Text embedding model for financial documents by [deleted]

Unrelated question: what dataset are you using for the task?

Select_Beautiful8 t1_jbifwzt wrote on March 9, 2023 at 7:15 AM

Reply to [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

I have one 6GB vram GPU, which model should I use?

ortegaalfredo OP t1_jbi81mn wrote on March 9, 2023 at 5:42 AM

Reply to comment by ReginaldIII in [R] Created a Discord server with LLaMA 13B by ortegaalfredo

I posted the github repo in the original post. The output is bad because Meta's original generator is quite bad. I upgraded it today and its much better now. Still not chatgpt.

ilovekungfuu t1_jbi2xao wrote on March 9, 2023 at 4:50 AM

Reply to [R] Analysis of 200+ ML competitions in 2022 by hcarlens

Thank you very much u/hcarlens !
This distills so much of information .

Do you think that we might be at a plateau in terms of new methods being tried out (say, since last 24 months) ?

eyeofthephysics t1_jbhu9d4 wrote on March 9, 2023 at 3:33 AM

Reply to [D] Text embedding model for financial documents by [deleted]

First I would say there exist versions of FinBERT which aren't just tuned for sentiment analysis. There are two groups who developed models they called FinBERT https://arxiv.org/abs/1908.10063 and https://arxiv.org/abs/2006.08097. The first paper's model can be fond here and is tuned for seniment analysis but the second model, found here, was pre-trained using masked language modelling on general financial text. So that one can be fine-tuned for other tasks.

Since you're interested in text embeddings, you may also be interested in this paper https://arxiv.org/pdf/2111.00526.pdf. The focus of that paper is sentiment analysis, but the general idea of using a sentence-BERT model to get better textual embeddings (as opposed to using vanilla BERT) should hold more generally.

whata_wonderful_day t1_jbhp4gb wrote on March 9, 2023 at 2:53 AM

Reply to comment by Jepacor in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Thanks, alas I thought it was an encoder model. I've been on the lookout for a big one, largest I've seen is deberta V2 with 1.5B params

[deleted] OP t1_jbhmtgl wrote on March 9, 2023 at 2:35 AM

Reply to comment by jobeta in [D] Text embedding model for financial documents by [deleted]

[deleted]

jobeta t1_jbhg8xy wrote on March 9, 2023 at 1:46 AM

Reply to comment by keisukegoda3804 in [D] Text embedding model for financial documents by [deleted]

Right. But to be able to assess this you need to define a task and evaluate your model’s performance to perform that task. Embedding accuracy cannot be discussed completely in the ether. Even the most general comments you will read about a model beating another, will refer to that new model performing better for specific tasks on benchmark datasets.

It would be a lot easier to help you if you explained what you are trying to accomplish that requires “higher accuracy”

Toilet_Assassin t1_jbh2499 wrote on March 9, 2023 at 12:01 AM

Reply to comment by bubudumbdumb in [D] Text embedding model for financial documents by [deleted]

There are teams of people working on and refining this in the mortgage and quantitative finance industries, it is highly unlikely that they would be willing to open source it for competitors to compare notes and take market share.

Recent comments in /f/MachineLearning