Recent comments in /f/MachineLearning

Striking-Travel-6649 t1_jbjkstq wrote

I think you're on the money. Once we develop more novel network and system structures that are really good at what they do while still generalizing, it will be game over. I think the current models that ML engineers have created are not complex or nuanced enough to extract the kind of value that humans can out of a "small" number of tokens. The human brain is great at having centralized control, coordination across systems, and effective interconnection, and each subsystem can do its "tasks" extremely well and can generalize across tasks too. With that in mind, we are going to need much more complex systems to achieve AGI.

−2

harharveryfunny t1_jbjk9nb wrote

Just to follow up, the reason why the "interact with the world" approach is way more efficient is because it's largely curiosity driven - we proactively try to fill gaps in our knowledge rather than just go read a set of encyclopedias and hope it might cover what we need to know. We learn in a much more targeted fashion..

10

bivouac0 t1_jbjk79f wrote

Truthfully, this has not been sufficiently researched and looking into this might yield improvements to LLMs. However it's also not completely surprising. Consider...

For Humans, something like 80% of a conversation is non-verbal (there are actual studies on this). This means that people get the meaning of words through other clues such as expression, tone, etc.. and thus our conversational inputs are much "richer" than simply a bunch of tokens.

You also need to consider that our verbal communication is augmented by a lot of other sensory input (ie.. visual). You learn what a "ball" is largely by seeing it, not hearing about it.

Also realize that LLMs generally use a very low learning rate (ie.. 1e-3) so a large number of tokens must be presented. It's not completely clear with people how this works but we do completely memorize some inputs (ie.. LR=1) and almost completely ignore others. This in itself could be an entire area of research. It would be good to understand why some phrases are "catchy" and others are forgettable. Obviously, AI today doesn't do this.

I'd also point out that LLMs are not exactly memorizing information. Studies have demonstrated their ability to learn facts but this is not purposeful knowledge retention. People have a better ability to do this and I suspect AI needs to develop a method to separate knowledge retention and language pattern modeling. Think about learning the state capitals. A person quickly learns to say "the capital of X is Y" and then can substitute in different memorized facts. AI learns the facts and the sentence patterns all in the same manner.

People can also use "thought" (ie.. search, hypothesis, etc..) to understand the meaning of sentences and form responses. Let's face it, at this point LLMs are just a brute force pattern matchers. There's nothing "intelligent" here.

8

ThePerson654321 OP t1_jbjisn7 wrote

  1. Sure. RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer. Comparing to, say, DALL-E 2 (that has exploded) which only came out 9 months ago it still feels like some organization would have picked RVWK if it was as useful as the developer claim.

  2. This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

  3. Not necessarily. Google, OpenAI, Deepmind etc tests things that doesn't work out all the time.

  4. Does not matter. If your idea is truly good you will get at attention sooner or later anyways.


I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

I personally have two potential explainations to my question:

  1. It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
  2. The community is basically really slow to embrace this due to some unknown reason.

I am leaning towards the first one.

5

harharveryfunny t1_jbjhmif wrote

Humans don't learn by locking themselves in a room at birth with a set of encyclopedias, or a print-out of the internet. We learn by interaction with the world - perceive/generalize/theorize/experiment, learn from feedback, etc.

It's impressive how well these LLM's perform given what is really a very tough task - build an accurate world model given only "predict next word" feedback, but hardly surprising that they need massive amounts of data to compensate for the task being so tough.

23

IntelArtiGen t1_jbjg4wk wrote

>Humans need substantially fewer tokens than transformer language models.

We don't use tokens the same way. In theory you could build a model with 10000 billion tokens, including one for each number up to a limit. Obviously humans can't and don't do that. We're probably closer to a model which would do "characters of a word => embedding". Some models do that but they also do "token => embedding" because it improves results and it's easier for the models to learn. Those who make these models may not really care about the size of the model if they have the right machine to train it and if they just want to have the best results on a task without constraints on size efficiency.

Most NLP models aren't efficient regarding their size. Though I'm not sure there currently is a way to keep having the best possible results without doing things like this. If I tell you "what happened in 2018?", you need to have an idea of what "2018" means, and that it's not just a random number. Either: (1) you know it's a year because you've learned this number like all other tokens (and you have to do that for many numbers / weird words and you have a big model), or (2) you think it's a random number, you don't need 1 token / number, your model is much smaller, but you can't answer these questions precisely, (3) you can re-build an embedding for 2018 knowing it's 2-0-1-8, you have an accurate "character => embedding" model.

I don't think we have a perfect solution for (3) so we usually do (1) & (3). But just doing (3) is the way to go for smaller NLP models... or putting much more weights for (3) and much less for (1).

So the size of NLP models doesn't really mean anything, you could build a model with 100000b parameters but 99.999% of these parameters won't improve the results a lot, and are only required to answer very specific questions. We should focus on building better "character => embedding" models and on ways to compress word embeddings if we care about the size of NLP models (easier said than done).

1

LetterRip t1_jbjfiyg wrote

  1. The larger models (3B, 7B, 14B) have only been released quite recently

  2. Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted

  3. people want to know that it actually scales before investing work into it.

  4. Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.

13

etesian_dusk t1_jbioscf wrote

Comparing package size to "sore sourcecode" size is kind of misleading. The pytorch codebase by itself isn't 1 GB.

Also, in most usecases, I'd rather have pytorch's versatility, than be able to brag about <1000 lines.

1

lifesthateasy t1_jbim6l5 wrote

Look it's really hard to argue with you when I present my finding and you're like "well I've never read anything of the like so it mustn't be true". Feel free to check this article, if you look closely, you'll find evidence of so-called "emergent abilities" are only emergent because we choose incorrect evaluation metrics, and once we choose ones that better describe the results and are not biased with usefulness to humans, you can see those metrics don't account for gradual improvements, and that's the only reason the abilities seem "emergent". If you consider a holistic model about something like GPT-3 and its aggregate performance along benchmarks, you can find the accuracy is smooth with scale. Emergent abilities would have an exponential scale. https://www.assemblyai.com/blog/emergent-abilities-of-large-language-models/ Since I can't post images here, check the image with the text "Aggregate performance across benchmarks for GPT-3 is smooth" in above article, which supports this notion.

So even *if* emergent abilities were a thing, and you'd argue consciousness is an emergent ability, there's data that shows there's nothing emergent about GPT's abilities, so then consciousness could also not have emerged.

Yes, GPT3 is the third round, and I'm saying GPT3 is static in its weights. It doesn't matter that they're making a GPT4, because I'm saying these models don't learn like we do. And they don't. GPT4 is a separate entity. Even *if* GPT3 had a conscience, it would have no connection to GPT4 as they're separate entities in a separate space of hardware, while human consciousness evolves within the same "hardware" and never stops learning. It even adds new connections until the end of our lives, which GPT3 doesn't (and yes, you're severely misinformed on that 25 year age barrier, that's an antiquated notion. To prevent you form going "well I've never read that" again, here's an article with plenty more to support it if you can google: https://cordis.europa.eu/article/id/123279-trending-science-do-our-brain-cells-die-as-we-age-researchers-now-say-no: "New research shows that older adults can still grow new brain cells." ). You can't even compare GPT3 to 4 in brain/human consciousness terms, because GPT4 will have a different architecture and quite likely even trained on different data. So it's not like GPT3 learns and evolves, no, GPT3 is set and GPT4 will be a separate thing - *completely unlike* human consciousness.

About determinism, I don't know if you're misunderstanding me on purpose, but what I'm saying is an artificial neuron in an NN has one activation function, one input and one output (even though the output can be and often is a vector or a matrix). At best it's bidirectional, but even bidirectionality is solved with separate pathways that go back, activation functions themselves are feedforward and to the same input they always give the same output. Brain cells however, are not only multidirectional without extra backwards connections, but they can keep some residual electric charge that can change the output (both its direction and strength) based on that residual charge. This residual activation can have a number of effects on the neuron's firing behavior, including increasing the strength of subsequent firing events and influencing the direction and timing of firing.

Since I can't be arsed to type any more, here's someone else who can explain it to you why brain neurons and artificial neurons are fundamentally different: https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7 Even this article has some omissions, and I want to highlight how in the past we though neurons would always fire when getting a stimulus, and start firing when they stopped getting the stimulus (as artificial neurons do), but in fact there's been new discoveries showing that human neurons also exhibit persistent activity: neural firing that continues after the triggering stimulus goes away.

1

eyeofthephysics t1_jbhu9d4 wrote

First I would say there exist versions of FinBERT which aren't just tuned for sentiment analysis. There are two groups who developed models they called FinBERT https://arxiv.org/abs/1908.10063 and https://arxiv.org/abs/2006.08097. The first paper's model can be fond here and is tuned for seniment analysis but the second model, found here, was pre-trained using masked language modelling on general financial text. So that one can be fine-tuned for other tasks.

Since you're interested in text embeddings, you may also be interested in this paper https://arxiv.org/pdf/2111.00526.pdf. The focus of that paper is sentiment analysis, but the general idea of using a sentence-BERT model to get better textual embeddings (as opposed to using vanilla BERT) should hold more generally.

2

jobeta t1_jbhg8xy wrote

Right. But to be able to assess this you need to define a task and evaluate your model’s performance to perform that task. Embedding accuracy cannot be discussed completely in the ether. Even the most general comments you will read about a model beating another, will refer to that new model performing better for specific tasks on benchmark datasets.

It would be a lot easier to help you if you explained what you are trying to accomplish that requires “higher accuracy”

1