Recent comments in /f/MachineLearning

farmingvillein t1_j8p269l wrote

> I hope more catch on because the lack of a limited context length is a game changer.

I'd be cautious about concluding this, without more testing.

RNNs, in some theoretical sense, support infinite context more easily than N^2 transformers; in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

30

zdss t1_j8osnth wrote

I've just skimmed the paper, but this is a confusing result. I can see a simpler optimizer paying off when using similar amounts of computing due to being able to run more iterations, but they claim it's also better on a per-iteration basis across the entire learning task. There's not a lot going on in this algorithm, so where is the magic coming from?

It's kind of hard to believe that while people were experimenting with all these more complex optimizers no one tried something this simple and saw that it had state-of-the-art results.

10

currentscurrents t1_j8op44d wrote

Does it though? There was a reproducibility survey recently that found that many optimizers claiming better performance did not in fact work for anything other than the tasks tested in their papers.

Essentially they were doing hyperparameter tuning - just the hyperparameter was the optimizer design itself.

64

mz_gt t1_j8ofbiq wrote

This is really awesome! I’ve been seeing the progress of your work on RWKV and I have to ask: I know you’ve mentioned a lot of RWKV is using tricks from here and there, and adding a lot of your own tweaks of course, but have you considered writing a paper? There are plenty of highly renowned published works with less to say than RWKV.

I think a renewed discussion about RNNs is more than warranted right now given the current direction with transformers, and the highly complicated nature of HiPPOs are personally not something I see replacing it anytime soon.

60

terath t1_j8oemyz wrote

Another key phrase to use with google scholar is "online learning", this is where you have a stream of new examples and you update a model one example at a time. Usually you can use the model for inference at any point in this process, and some algorithms in this area are designed to be a bit more aggressive or at least to control the update rates to more quickly more more slowly adapt to new data.

21