Recent comments in /f/MachineLearning
csreid t1_j8p5z30 wrote
Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
But they theoretically support infinite context length. Getting it is a problem to be solved, not a fundamental incompatibility like it is with transformers.
Binliner42 t1_j8p5iqj wrote
Reinforcement learning?
avocadoughnut t1_j8p3psq wrote
Reply to comment by redv in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
He has trained several smaller RWKV models. You can find them on huggingface
maizeq t1_j8p3f1s wrote
Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Any papers I can refer to that for that last paragraph? I expect it is true but would love to see some empirical work.
MysteryInc152 t1_j8p2jrd wrote
Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
That's fair. we won't know till it's tested for sure.
farmingvillein t1_j8p269l wrote
Reply to comment by MysteryInc152 in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
> I hope more catch on because the lack of a limited context length is a game changer.
I'd be cautious about concluding this, without more testing.
RNNs, in some theoretical sense, support infinite context more easily than N^2 transformers; in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.
TheGamingPhoenix_000 t1_j8p1o9l wrote
Reply to [D] Simple Questions Thread by AutoModerator
Dumb Question: Where is a good resource to understand the actual math going on, most resources I find with a simple google search is only api usage, not actually what all the parameters and such mean
MadScientist-1214 t1_j8ox26g wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
Better than AdamW if (a) the model is a transformer, (b) not a lot of augmentations are used. Otherwise, the improvements are not that large. I doubt this optimizer works well with regular CNNs like efficientnet or convnext.
[deleted] t1_j8ovtbo wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
[removed]
Jean-Porte t1_j8oswiy wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
I'm waiting for deberta glue/superglue results, it's weird that they picked T5 for that
zdss t1_j8osnth wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
I've just skimmed the paper, but this is a confusing result. I can see a simpler optimizer paying off when using similar amounts of computing due to being able to run more iterations, but they claim it's also better on a per-iteration basis across the entire learning task. There's not a lot going on in this algorithm, so where is the magic coming from?
It's kind of hard to believe that while people were experimenting with all these more complex optimizers no one tried something this simple and saw that it had state-of-the-art results.
DigThatData t1_j8or4jr wrote
i feel like voila is pretty hard to beat, especially considering it already ships with jupyter. just change the word "tree" in your URL to "voila" and bam: your notebook's a webapp.
redv t1_j8opi69 wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Is it possible to run this on a latptop using CPU and with less than 16GB of ram? If yes, then how does one do this? Thanks.
currentscurrents t1_j8op44d wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
Does it though? There was a reproducibility survey recently that found that many optimizers claiming better performance did not in fact work for anything other than the tasks tested in their papers.
Essentially they were doing hyperparameter tuning - just the hyperparameter was the optimizer design itself.
SnuggleWuggleSleep t1_j8omznf wrote
Reply to [D] Simple Questions Thread by AutoModerator
How do LSTMs for sports prediction work? My understanding with LSTMs is that they're predicting the next step in a sequence, but a sports match is two sequences coming against each other.
MunichNLP32 t1_j8omtkk wrote
Imho In context learning is doing that, For more literature read: https://arxiv.org/abs/2211.15661
cantfindaname2take t1_j8omdij wrote
Reply to comment by terath in [D] Is anyone working on ML models that infer and train at the same time? by Cogwheel
This! Online learning is a very common term that is used in time series modeling, for example in anomaly or change point detection.
deitscherdeifl t1_j8om1z7 wrote
Im not deep in it, but maybe hierarchical temporal memory by numenta ist interesting for you
MysteryInc152 t1_j8oj9qx wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Fantastic work. Thanks for doing this. Good luck scaling to 24b. I hope more catch on because the lack of a limited context length is a game changer.
mz_gt t1_j8ofbiq wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
This is really awesome! I’ve been seeing the progress of your work on RWKV and I have to ask: I know you’ve mentioned a lot of RWKV is using tricks from here and there, and adding a lot of your own tweaks of course, but have you considered writing a paper? There are plenty of highly renowned published works with less to say than RWKV.
I think a renewed discussion about RNNs is more than warranted right now given the current direction with transformers, and the highly complicated nature of HiPPOs are personally not something I see replacing it anytime soon.
terath t1_j8oemyz wrote
Another key phrase to use with google scholar is "online learning", this is where you have a stream of new examples and you update a model one example at a time. Usually you can use the model for inference at any point in this process, and some algorithms in this area are designed to be a bit more aggressive or at least to control the update rates to more quickly more more slowly adapt to new data.
[deleted] t1_j8odbw1 wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
[removed]
HighLevelJerk t1_j8oacj8 wrote
Reply to comment by Reddit1990 in [P] Introducing arxivGPT: chrome extension that summarizes arxived research papers using chatGPT by _sshin_
If ChatGPT really was that smart, it would just copy that
[deleted] t1_j8o87a8 wrote
[removed]
farmingvillein t1_j8p7lci wrote
Reply to comment by csreid in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Neither really work for super long contexts, so it is kind of a moot point.
Both--empirically--end up with bolt-on approaches to enhance memory over very long contexts, so it isn't really clear (a priori) that the RNN has a true advantage here.