Recent comments in /f/MachineLearning

sid_276 t1_j8q4yfx wrote

It depends on where in the ML spectrum you are. In RL it’s common to set agents to set some fraction of their time exploring and some other exploiting the environment. In neural nets there is the whole “online learning” field that addresses just that. It is generally possible but not always practical. There are other ways to update information. You mention ChatGPT. One way is giving them access to browsing to provide updated results. Technically one could retrain it on the conversations. I believe they will do it. But practically it makes more sense making it in batches e. g. Once a week or when a lot of new data has been accumulated. But yeah if you Google (or Bing lol) Online learning you will find a lot of papers

2

gwern t1_j8psc8m wrote

> Not clear to me what you are looking for here.

The question asked was pretty clear, to justify the statement:

>> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer. For example, you could construct a toy problem which requires memory reaching back exactly 1 state, and show that an arch with any memory outperforms memory-less arch; this would obviously tell you nothing of interest like 'this memory makes little use of history further back than 50 steps and none past 200 (and so is easily outperformed by history-stacking like a Transformer)'. Nor does comparing a Transformer with a history of say l=500 and an RNN, and the Transformer winning, tell you anything about why the RNN lost - ok, the Transformer did better, great, we have a superior new tool, but why? maybe it has similar memory problems and is just way better at the modeling part or memorizes better or something entirely different.

Likewise, unless you are comparing RNN baselines which somehow have known hard history constraints, they cannot tell you anything useful about how fast the effective memory fades out, how the accuracy of the memory is 'distributed' over the effective context window, if there are hard cutoffs, if the RNN is basically only using the last few states and so on.

In contrast, a Transformer has direct shortcut access to the history (we don't need any paper to know this, literally any GPT output exhibiting coherent long-range references past a few paragraphs demonstrates this directly), and so if you show that an RNN uses primarily the past 50 steps and simply 'fades out' completely past 200 steps and so the 'infinite history' is meaningless in practice, well, we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show), so a direct comparison is otiose. Directly examining a RNN's understanding of its history, as those papers do, is much better than some higher-level performance comparison, which is what most of those referenced papers do; direct performance comparisons are great, but do not ablate where the problem is on the RNN's end. (Although if I really needed one, I would prefer to point at the RNN vs Transformer scaling laws in context window anyway, like Kaplan et al 2020 IIRC, to show that the Transformers are making good use of it, not merely some sort of better-than-RNN use or gains elsewhere.)

4

MysteryInc152 t1_j8ppoiq wrote

I'd rather the basic senses at least (vision as well as audio) be pretrained as well. We know from Multimodal chain of thought as well as scaling laws for generative mixed modal language models that multimodal models far outperform single modal models on the same data and scale. You won't get that kind of performance gain leveraging those basic senses to outside tools.

https://arxiv.org/abs/2302.00923

https://arxiv.org/abs/2301.03728

2

Alarming_Turnover578 t1_j8poufw wrote

According to Cambridge Declaration on Consciousness that would be correct. Unique property of Homo Sapiens mind is sapience not consciousness or sentience.

2

farmingvillein t1_j8piz80 wrote

Not clear to me what you are looking for here.

> It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

The very first reference I pulled, Graves 2014, specifically compares w/ and w/o memory.

Or Dai et al, which tries to compare against various RNN-style baselines with similar parameters.

Perhaps we're talking past each other?

2

No_Stretch_9237 t1_j8pd8s0 wrote

Is it possible to run DeepSpeed+ZeRO with a Tesla P40 (24Gb) to make use of my 256gb main system memory during training? If so are there any examples of this particular setup or required cuda driver versions?

1

farmingvillein t1_j8p7qa8 wrote

Any of the papers that address building NLP for long contexts will tend to have a relevant related works section. E.g., https://arxiv.org/pdf/2109.00301.pdf.

(The one qualifier here is that, at "modern" scale, RNNs have not really been well-tested (since people tend to just use...transformers). So, maaaybe they are actually simply superior. Evidence so far says "doubtful", however (at least for more vanilla implementations).)

12