Competitive_Dog_6639 t1_j8qa7em wrote on February 16, 2023 at 4:17 AM

Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

ML acronyms are getting out of hand, just use any letter from any of the words I guess...

currentscurrents t1_j8qa46s wrote on February 16, 2023 at 4:17 AM

Reply to comment by Red-Portal in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

This is a hand-designed optimizer. By definition, learned optimizer researchers would rather we learn an optimizer than hand-design one.

Learned optimizers are probably the future, but the compute budget required to create one is prohibitive.

sid_276 t1_j8q4yfx wrote on February 16, 2023 at 3:33 AM

Reply to [D] Is anyone working on ML models that infer and train at the same time? by Cogwheel

It depends on where in the ML spectrum you are. In RL it’s common to set agents to set some fraction of their time exploring and some other exploiting the environment. In neural nets there is the whole “online learning” field that addresses just that. It is generally possible but not always practical. There are other ways to update information. You mention ChatGPT. One way is giving them access to browsing to provide updated results. Technically one could retrain it on the conversations. I believe they will do it. But practically it makes more sense making it in batches e. g. Once a week or when a lot of new data has been accumulated. But yeah if you Google (or Bing lol) Online learning you will find a lot of papers

autoraft t1_j8q25ak wrote on February 16, 2023 at 3:11 AM

Reply to comment by DigThatData in [P] Build data web apps in Jupyter Notebook with Python only by pp314159

Right, have been using voila for last three years. Quite mature project (never had any issues), and the project is actively maintained. But of course, there are tonnes of things still possible to add as features.

42gauge t1_j8pzroz wrote on February 16, 2023 at 2:52 AM

Reply to comment by Alarming_Turnover578 in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501

Fine, just mentally replace both instances of "conscious" with "sapient"

Smooth-Stick-5751 OP t1_j8pykyk wrote on February 16, 2023 at 2:43 AM

Reply to comment by ParmesanCharmeleon in Reinforcement Learning based algorithms specifically for NLP[D][P] by Smooth-Stick-5751

Got it, thanks.

gwern t1_j8psc8m wrote on February 16, 2023 at 1:56 AM

Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

> Not clear to me what you are looking for here.

The question asked was pretty clear, to justify the statement:

>> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer. For example, you could construct a toy problem which requires memory reaching back exactly 1 state, and show that an arch with any memory outperforms memory-less arch; this would obviously tell you nothing of interest like 'this memory makes little use of history further back than 50 steps and none past 200 (and so is easily outperformed by history-stacking like a Transformer)'. Nor does comparing a Transformer with a history of say l=500 and an RNN, and the Transformer winning, tell you anything about why the RNN lost - ok, the Transformer did better, great, we have a superior new tool, but why? maybe it has similar memory problems and is just way better at the modeling part or memorizes better or something entirely different.

Likewise, unless you are comparing RNN baselines which somehow have known hard history constraints, they cannot tell you anything useful about how fast the effective memory fades out, how the accuracy of the memory is 'distributed' over the effective context window, if there are hard cutoffs, if the RNN is basically only using the last few states and so on.

In contrast, a Transformer has direct shortcut access to the history (we don't need any paper to know this, literally any GPT output exhibiting coherent long-range references past a few paragraphs demonstrates this directly), and so if you show that an RNN uses primarily the past 50 steps and simply 'fades out' completely past 200 steps and so the 'infinite history' is meaningless in practice, well, we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show), so a direct comparison is otiose. Directly examining a RNN's understanding of its history, as those papers do, is much better than some higher-level performance comparison, which is what most of those referenced papers do; direct performance comparisons are great, but do not ablate where the problem is on the RNN's end. (Although if I really needed one, I would prefer to point at the RNN vs Transformer scaling laws in context window anyway, like Kaplan et al 2020 IIRC, to show that the Transformers are making good use of it, not merely some sort of better-than-RNN use or gains elsewhere.)

BrotherAmazing t1_j8ps3he wrote on February 16, 2023 at 1:55 AM

Reply to [D] Is anyone working on ML models that infer and train at the same time? by Cogwheel

Yes, a lot of people are doing this and have done this kind of thing using very different approaches. The field of lifelong machine learning is just one relevant area.

rapist1 t1_j8ppons wrote on February 16, 2023 at 1:36 AM

Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Could you please writeup the methods of RWKV in an arxiv paper, a standalone readme, or even a blog post format? I have read the description on the GitHub repository and it is very scattered and hard to read

MysteryInc152 t1_j8ppoiq wrote on February 16, 2023 at 1:36 AM

Reply to comment by swegmesterflex in [R] [N] Toolformer: Language Models Can Teach Themselves to Use Tools - paper by Meta AI Research by radi-cho

I'd rather the basic senses at least (vision as well as audio) be pretrained as well. We know from Multimodal chain of thought as well as scaling laws for generative mixed modal language models that multimodal models far outperform single modal models on the same data and scale. You won't get that kind of performance gain leveraging those basic senses to outside tools.

https://arxiv.org/abs/2302.00923

https://arxiv.org/abs/2301.03728

Alarming_Turnover578 t1_j8poufw wrote on February 16, 2023 at 1:30 AM

Reply to comment by 42gauge in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501

According to Cambridge Declaration on Consciousness that would be correct. Unique property of Homo Sapiens mind is sapience not consciousness or sentience.

gxh8N t1_j8pnniz wrote on February 16, 2023 at 1:21 AM

Reply to comment by liquiddandruff in [N] Microsoft integrates GPT 3.5 into Teams by bikeskata

Not at this quality.

farmingvillein t1_j8pni5v wrote on February 16, 2023 at 1:20 AM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Neither of these offer a comparative look against transformers, although they are certainly a useful look against the limitations of your basic RNN/LSTM.

ParmesanCharmeleon t1_j8pkmgi wrote on February 16, 2023 at 12:58 AM

Reply to Reinforcement Learning based algorithms specifically for NLP[D][P] by Smooth-Stick-5751

There is a paper from UW NLP that published the library RL4LMs and NLPO

farmingvillein t1_j8piz80 wrote on February 16, 2023 at 12:46 AM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Not clear to me what you are looking for here.

> It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

The very first reference I pulled, Graves 2014, specifically compares w/ and w/o memory.

Or Dai et al, which tries to compare against various RNN-style baselines with similar parameters.

Perhaps we're talking past each other?

gwern t1_j8ph8as wrote on February 16, 2023 at 12:33 AM

Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

I don't think the Related Works section of that paper provides any useful references. It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

Smooth-Stick-5751 OP t1_j8ph6bm wrote on February 16, 2023 at 12:33 AM

Reply to comment by Ferocious_Armadillo in Reinforcement Learning based algorithms specifically for NLP[D][P] by Smooth-Stick-5751

Thanks a ton!

gwern t1_j8pg3g7 wrote on February 16, 2023 at 12:25 AM

Reply to comment by maizeq in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

https://arxiv.org/abs/1805.04623 https://arxiv.org/abs/1702.04521

No_Stretch_9237 t1_j8pd8s0 wrote on February 16, 2023 at 12:04 AM

Reply to [D] Simple Questions Thread by AutoModerator

Is it possible to run DeepSpeed+ZeRO with a Tesla P40 (24Gb) to make use of my 256gb main system memory during training? If so are there any examples of this particular setup or required cuda driver versions?

waffles2go2 t1_j8p9q79 wrote on February 15, 2023 at 11:38 PM

Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Could you explain the table highlighting?

liquiddandruff t1_j8p8fxo wrote on February 15, 2023 at 11:29 PM

Reply to comment by gxh8N in [N] Microsoft integrates GPT 3.5 into Teams by bikeskata

whisper is an open source model and there are fast C++ open source implementations that can perform live transcription on an RPI, what are you talking about lol

Ferocious_Armadillo t1_j8p8cld wrote on February 15, 2023 at 11:28 PM

Reply to Reinforcement Learning based algorithms specifically for NLP[D][P] by Smooth-Stick-5751

Check this out. I’m doing AI research and there’s this paper on DenseNets which can be used in, and are directly applicable to, NLP.

Red-Portal t1_j8p86z3 wrote on February 15, 2023 at 11:27 PM

Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

Do learned optimizer people seriously believe this is the direction we should be going?

currentscurrents t1_j8p83n7 wrote on February 15, 2023 at 11:26 PM

Reply to comment by CabSauce in [D] Is anyone working on ML models that infer and train at the same time? by Cogwheel

>Distributed models would have to be updated. How do we update weights from two sources? (There might be options for this, I haven't looked.)

This sounds like federated learning.

farmingvillein t1_j8p7qa8 wrote on February 15, 2023 at 11:23 PM

Reply to comment by maizeq in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Any of the papers that address building NLP for long contexts will tend to have a relevant related works section. E.g., https://arxiv.org/pdf/2109.00301.pdf.

(The one qualifier here is that, at "modern" scale, RNNs have not really been well-tested (since people tend to just use...transformers). So, maaaybe they are actually simply superior. Evidence so far says "doubtful", however (at least for more vanilla implementations).)

Recent comments in /f/MachineLearning