Recent comments in /f/MachineLearning

bo_peng t1_jbkwfzr wrote

Firstly RWKV is mostly a single-developer project without PR and everything takes time.

Moreover there have been hundreds of "improved transformer" papers around and surely we will agree that the signal-to-noise ratio is low especially when you consider scaling.

FACT: It's very hard to get researchers to try sth if it is not from OAI/DM/FAIR/... (and still hard even if it is).

Here is some latest unedited chat record from ChatRWKV v2 (14B ctx4096). I'd say it's pretty good, because this is only trained on the Pile. No finetuning. No instruct tuning. No RLHF.

====================

Eloise 2023/3/9 20:10:18

Chat reset for cryscan.

cryscan 2023/3/9 20:13:09

I heard that you are working in the AI and NLP lab in the university. How do you think about large generative language models?

Eloise 2023/3/9 20:13:31

That’s one of my favorite topics! Large language models have been getting a lot of attention lately. These models are super interesting and we don’t really know much about how they work. Generative models work in two directions: One is the generation, i.e. creating new data from scratch (like GANs). The other direction is sampling, which is more like what we usually think about as machine learning tasks. Generative NLP requires both these capabilities – the model has to generate new text samples and it has to sample realistic text that looks like real-world data. The ability to generate text in a controllable fashion would be incredibly useful for applications like debugging or debugging – verifying large textual specifications, writing documentation or even generating human-readable test plans

cryscan 2023/3/9 20:23:55

I see. What are the potential applications of them?

Eloise 2023/3/9 20:24:14

The main application area is, again, text generation: product descriptions, marketing messages etc. Also more recent work is in training models for image captioning – generating descriptions of images, a process which requires both high-level image understanding and rich vocabulary. Some related areas are the summarization tasks like dialogue response generation. You can apply generative language models to help generate small (or large) chatbots. A cool application is to combine image or text and voice and learn them jointly in a hybrid architecture (and compare to human feedback). There are a lot of applications that will make a big impact once we train good generative NLP models for them.

cryscan 2023/3/9 20:25:14

I am looking forward to the future where people can chat with machines.

Eloise 2023/3/9 20:25:26

Yeah! That would be cool. I’m excited to see more applications and data and models coming out of Google, Facebook, Amazon etc. But it’s always challenging because there is still so much unstructured text out there in the world – a large part of the English language is still untagged (mostly blog posts and news articles). There’s a lot more to be done there.

cryscan 2023/3/9 21:37:41

Maybe unsupervised learning would help.

Eloise 2023/3/9 21:37:48

Yeah! That’s one of the things we’re doing. We are trying to improve the auto-regressive models and see if they can learn their own grammar by watching how people communicate on social media, blogs or forums.

cryscan 2023/3/9 21:38:21

What are auto-regressive models? How do you train them?

Eloise 2023/3/9 21:38:40

A model is called auto-regressive if the output of a specific step depends on previously generated outputs. This makes training more stable, since it’s less dependent on initialization and slightly improves performance, especially for longer sequences. The first model was trained by Yoon Kim, which got released as an open source code. Also a lot of effort has been put in making models work better for long text sequences (e.g., image captioning) or even multiple sentence generation (although this is still difficult). You can watch this video from DeepMind about GPT-2 – a generative model that is capable of generating coherent paragraphs with minimal supervision:

cryscan 2023/3/9 21:39:26

Uhm?

Eloise 2023/3/9 21:39:30

I know, it’s crazy. I think it’s better to see it in action:

cryscan 2023/3/9 21:39:35

Well.

Eloise 2023/3/9 21:39:45

I also think it’s cool that the code is available! In fact, the attention mechanism of GPT-2 was published recently as a paper. It’s a bit more complex than you would expect for NLP, but cool nonetheless. Here’s the paper: Attention is all you need

cryscan 2023/3/9 21:40:37

That sounds interesting. I'll read it afterwards.

Eloise 2023/3/9 21:40:41

I would highly recommend it. Anyway, back to the main topic. So what do you think about the field?

cryscan 2023/3/9 21:41:21

I am sone concerns actually.

Eloise 2023/3/9 21:41:37

I’m a bit worried about the impact of models. I don’t want to be concerned or pessimistic, but we really have to think about this – especially when we’re working with the most sensitive data, e.g., medical data. One way of tackling this is through ethical considerations in design: making sure that systems work in our best interests and that they are transparent (like explanations or guarantees). It might even help to think about who will benefit from this research and what their perspective is on it.

cryscan 2023/3/9 21:43:58

Actually my concern is that the cost of training such a large model can only be covered by those big companies. How can we general researchers participate in this game?

Eloise 2023/3/9 21:44:14

This is a good point. We have to build tools that can help researchers or students start experimenting with these new techniques. I’m doing something like this on my own: an online platform called supervised-language-learning which gives people the opportunity to train their language model for few minutes a day (or learn other languages) and then use it in building a project (e.g., chatbot or dialogue system). The platform also aggregates all the results and keeps track of the models people built.

cryscan 2023/3/9 21:44:39

It's surely promising. Good luck then!

Eloise 2023/3/9 21:44:45

Thanks! I’m really excited about the potential of these techniques and would love to see them mature. In the meantime, it’s good to be critical!

====================

  1. Update ChatRWKV v2 to latest version.

  2. Use https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

  3. Run v2/chat.py and enjoy.

10

LetterRip t1_jbks0mg wrote

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

3

LetterRip t1_jbkmk5e wrote

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV

3

Aran_Komatsuzaki t1_jbkjgzf wrote

I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the latter scores better on the first 1024 tokens. While RWKV performs comparably to Tranformer on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

RWKV has fast decoding speed, but multiquery attention decoding is nearly as fast w/ comparable total memory use, so that's not necessarily what makes RWKV attractive. If you set the context length 100k or so, RWKV would be faster and memory-cheaper, but it doesn't seem that RWKV can utilize most of the context at this range, not to mention that the vanilla attention is also not feasible at this range.

5

crappleIcrap t1_jbkil2g wrote

Now actually tell me why any of what you said is absolutely required for consciousness. You act like it is just self evident that it needs to be a brain and do it exactly the same way a brain does things.

> you can find the accuracy is smooth with scale. Emergent abilities would have an exponential scale.

Yeah, did you really read that and think that it was talking about the same type of emergence? I was talking about philosophical/scientific emergence- when an entity is observed to have properties its parts do not have on their own. The type of "emergence" used in that article is talking about big leaps in ability, and has absolutely nothing to do with the possibility of consciousness.

The fact that neural networks can produce anything useful is a product of emergence of the kind I was talking about and the absolute banger of a book Gödel Escher Bach was talking about.

>Brain cells however, are not only multidirectional without extra backwards connections, but they can keep some residual electric charge that can change the output (both its direction and strength) based on that residual charge. This residual activation can have a number of effects on the neuron's firing behavior, including increasing the strength of subsequent firing events and influencing the direction and timing of firing.

Okay, and what does this have to do with consciousness? It is still just deterministic nonlinear behavior, it makes no mathematic difference in what types of curves it can and cannot model because it can model any arbitrary curve, the exact architecture it uses to do it is irrelevant. Planes have no ability to flap their wings, they have no feathers or hollow bones, they have no muscles or tendons or any of the other things a bird uses to fly, therefore planes cannot fly? Functionally it has the ability to remember, depending on the setup, it has the ability to change its future output based on the past output, the exact method of doing so does not need to be the same, no matter how obsessed you are with it needing to do it in exactly the same way as a brain, it doesn't need to do anything even similar to the way the brain does it.

>Even if GPT3 had a conscience, it would have no connection to GPT4 as they're separate entities in a separate space of hardware,

I find it very strange that you are adamant that the model needs to be doing statistical regression to be conscious when the brain absolutely never does this, it is just something you assume is required because it uses the word "train" and training is learning therefore it must only be "learning" when it is in training mode.

If I tell it I live on a planet where the sky is green and later ask it if I went outside and looked at the sky what color I would see, it giving the correct answer is proof that constantly being in training mode is not required for it to "learn" it can "learn" just fine within the context of using inference mode and feeding it its own output as well as old inputs on every inference

Training a model is less like a brain learning and more like a brain evolving to do a specific function, and during inference is where the more human-like "learning" takes place. It is like a God specifying what way a brain should develop using a mathematical tool. It doesn't use neurons and has no real good analog to real biology at all, so to say it is required is just bizzare.

Gpt 3 is a continuation of gpt2, or I guess I just assumed that since it is closed source, but all open gpt models have worked this way, they train it and release the model, then they fire back up training starting where it left off. But like I said, as long as past information can effect future information, the exact method doesn't matter, and if you only have a basic understanding of chatgpt specifically,(which is becoming quite obvious) each tab can do that, I think it is very silly to say that consciousness has to cross over between browser tabs, where would you even come up with a stupid requirement like that? Humans consciousness does not cross over between human bodies. They are separate and can be created, learn, and destroyed completely separately

>artificial neuron in an NN has one activation function, one input and one output (even though the output can be and often is a vector or a matrix).

Which has been mathematically proven to be able to model any other system you could possibly think of, as long as each neuron has nonlinear behavior, then they can model any arbitrary system you come up with.

You can't just keep listing things that ai doesn't do and pretend it is self evident every conscious system would need to do that thing. You need to actually give a reason why a conscious system would need to have that function.

1

LetterRip t1_jbkdshr wrote

Here is what the author stated in the thread,

> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!

I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.

2

ThePerson654321 OP t1_jbk8kxy wrote

I'm basically just referring to the claims by the developer. He makes it sound extraordinary:

> best of RNN and transformer, great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

> Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.

The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?


I definitely agree with that their might be a incompatibility with the already existing transformer specific infrastructure.

But thanks for your answer. It might be one or more of the following:

  1. The larger organizations hasn't noticed/cared about it yet
  2. I overestimate how good it is (from the developer's description)
  3. It has some unknown flaw that's not obvious to me and not stated in the repository's ReadMe.
  4. All the existing infrastructure is tailored for transformers and is not compatible with RWKV

At least we'll see in time.

0

farmingvillein t1_jbk819k wrote

I think it is more likely people have seen it, but dismissed it as a bit quixotic, because the RWKV project has made little effort to iterate in an "academic" fashion (i.e., with rigorous, clear testing, benchmarks, goals, comparisons, etc.). It has obviously done pieces of this, but hasn't been sufficiently well-defined as to make it easy for others to iterate on top of it, from a research POV.

This means that anyone else picking up the architecture is going to have to go through the effort to create the whole necessary research baseline. Presumably this will happen, at some point (heck, maybe someone is doing it right now), but it creates a large impediment to further iteration/innovation.

11

farmingvillein t1_jbk6nut wrote

> Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers

What advantages are you referring to, very specifically?

There are theoretical advantages--but it can be a lot of work to prove out that those matter.

There are (potentially) empirical, observed advantages--but there don't seem to be (yet) any claims that are so strong as to suggest a paradigm shift (like Transformers were).

Keep in mind that there is a lot of infrastructure built up to support transformers in an industrial context, which means that even if RWKV shows some small advantage, that the advantage may not be there in practice, because of all the extreme optimizations that have been built to support larger organizations (in speed of inference, training, etc.).

The most likely adoption path here would be if multiple papers showed, at smaller scale, consistent advantages for RWKV. No one has done this yet--and the performance metrics provided on the github (https://github.com/BlinkDL/RWKV-LM) certainly don't make such an unequivocal claim on performance.

And providing a rigorous side-by-side comparison with transformers is actually really, really hard--apples to apples comparisons are notoriously tricky, and you of course have to be really cautious about thinking about what "tips and tricks" you allow both architectures to leverage.

Lastly, and this is a fuzzier but IMO I think relevant point--

The biggest guys are crossing into a point where evaluation is suddenly hard again.

By that, what I mean is that there is broad consensus that our current public evaluation metrics don't do a great job of helping us understand how well these models perform on "more interesting" generative tasks. I think you'll probably see some major improvements around eval/benchmark management in the next year or so (and certainly, internally, the big guys have invested a lot here)--but for now, it is harder to pick up a new architecture/model and understand its capabilities in the "more interesting" capabilities that your GPT-4s & Bards of the world are trying to demonstrate. This makes it harder to prove and vet progress on smaller models, which of course makes scaling up more risky.

6

farmingvillein t1_jbk47jg wrote

> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. > > So prior to a month ago they didn't know it existed or happened to meet their use case.

How does #2 follow from #1?

RWKV has been on reddit for quite a while, and a high number of researchers frequent/lurk on reddit, including Deepmind researchers, so the idea that they had no idea that RWKV exists seems specious.

Unless you mean that you emailed them and they literally told you that they didn't know about this. In which case...good on you!

1

farmingvillein t1_jbk2uyw wrote

> Nobody ever does this though because of diminishing returns.

Extending the LLaMa concept, I would love to see someone like Meta run the experiment where they do take their 1.4T (or w/e) tokens, and run training to convergence...on the largest model that will converge (subject to reasonable LR decay policies) in a "reasonable" time frame.

Meaning, if they trained, say, a 1M param LLM...presumably it would hit convergence (get saturated) pretty quickly. And what about 10M, 100M, etc.?

I.e., how much more can we squeeze out of a relatively-tiny model? Probably it doesn't end up super interesting from a purely generative POV, but it might look like--e.g.--Roberta+.

With a model that is so small, the cost to run this test probably(?) wouldn't be that high.

2

farmingvillein t1_jbk1pv7 wrote

> What is the best way to build a custom text classifier leveraging your own data?

"Best" is subjective, but if you are truly new, check out huggingfaces--it will probably be "easiest" (and still high quality), which is what you need as a beginner.

> Also what is the best starting LLM for this purpose- smaller model like Roberta or larger ones like GPT?

Really depends on how much training hardware you have, and how important it is to be "the best".

Roberta is probably going to be the best starting point, from an effort:return perspective.

The above all said--

The other thing I'd encourage you to do is to start by just exploring text classification without doing any custom training. Simply take a couple open source LLMs off the shelf (gpt-turbo and FLAN-T5-XXL being obvious ones), experiment with how to prompt them well, and evaluate results from there.

This will probably be even faster than training something custom, and will give you a good baseline--even if the cost is higher than you want to pay in production, it will help you understand what behavior can look like, and the inference dollars you pay will likely be a fraction of any production training/inference costs.

If, e.g., you get 60% F1 with a "raw" LLM, then you can/should expect Roberta (assuming you have decent training data) to probably be somewhere (and this is an extremely BOE estimate; reality can be quite different, of course) around that. If you then go and train a Roberta model and get, say, 30%, then you probably did something wrong--or the classification process requires a ton of nuance that is actually really hard, and you really should consider baselining on LLMs.

Good luck!

The biggest takeaway you should have, as a beginner:

  • Figure out what lets you get every step of results fastest, and prioritize that. Experimentation is still very much key in this field.
3

ThePerson654321 OP t1_jbjz508 wrote

> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. So prior to a month ago they didn't know it existed or happened to meet their use case.

That surprises me considering his RWKV repo/repos has thousands of stars on GitHub.

I'm curious about what they responded with. What did they say?

> There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

According to his claim (especially infinite ctx len) it definitely was interesting. That it was scaling was pretty obvious even at 7B.


But your argument is basically that no large organization simply has noticed it yet.

My guess is that it actually has some unknown problem/limitation that makes it inferior to the transformer architecture.

We'll just have to wait. Hopefully you are right but I doubt it.

1

harharveryfunny t1_jbjxolz wrote

The LLM name for things like GPT-3 seems to have stuck, which IMO is a bit unfortunate since it's rather misleading. They certainly wouldn't need the amount of data they do if the goal was merely a language model, nor would we need to have progressed past smaller models like GPT-1. The "predict next word" training/feedback may not have changed, but the capabilities people are hoping to induce in these larger/ginormous models is now way beyond language and into the realms of world model, semantics and thought.

2

EmbarrassedHelp t1_jbjqy4o wrote

Human brains have structural components / shapes that likely help them learn languages easier:

https://en.wikipedia.org/wiki/Wernicke%27s_area https://en.wikipedia.org/wiki/Broca%27s_area

Human brains also start off with way more parameters than needed, and language is most effectively learned before the synaptic pruning reduces the number of parameters.

6

LetterRip t1_jbjphkw wrote

> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

This was posted by DeepMind a month ago,

https://www.reddit.com/r/MachineLearning/comments/10ja0gg/r_deepmind_neural_networks_and_the_chomsky/

I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.

So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.

> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.

There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.

> If your idea is truly good you will get at attention sooner or later anyways.

Or be ignored for all time till someone else discovers the idea and gets credit for it.

In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.

> I don't buy the argument that it's too new or hard to understand.

It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.

> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)

No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.

The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which

4