Recent comments in /f/MachineLearning

LetterRip t1_j6uj087 wrote

> ​Further, "Let's think step by step" is outperformed by "Write Python code to solve this."

Interesting I was just wondering while reading that paper how well that would work compared to the n-shot prompts.

> Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

>> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

That's fair. My thoughts were mostly directed at the "Table 2: Solve rate on three symbolic reasoning datasets and two algorithmic datasets" items. I think you could be right that my comments don't apply to the results in Figure 5 (GSM8K GSM-HARD SVAMP ASDIV SINGLEEQ SINGLEOP ADDSUB MULTIARITH).

Would be curious how well the 'write python code to solve this' performs in and of itself vs the "Let's think things through step by step" prompt.

1

Blutorangensaft t1_j6uiygz wrote

Disclaimer: no help, more a request

Once you're done with this project, would you mind sharing your speed and accuracy? I'm kind of on the lookout for a good English NER model. Problem is, spacy has some issues with casing.

1

DigThatData t1_j6ugpgr wrote

very difficult is correct. The authors identified 350,000 candidate prompt/image pairs that were likely to have been memorized because they were duplicated repeatedly in the training data, and were only able to find 109 cases of memorization in Stable Diffusion in that 350k.

EDIT:

Conflict of Interest Disclosure: I'm a Stability.AI employee, and as such I have a financial interest in protecting the reputation of generative models generally and SD in particular. Read the paper for yourself. Everything here is my own personal opinion, and I am not speaking as a representative of Stability AI.

My reading is that yes: they demonstrated these models are clearly capable of memorizing images, but also that they are clearly capable of being trained in a way that makes them fairly robust to this phenomenon. Imagen has a higher capacity and was trained on much less data: it unsurprisingly is more prone to memorization. SD was trained on a massive dataset and has a smaller capacity: after constraining attention to the content we think it had the best excuse to have memorized, it barely memorized any of it.

There's almost certainly a scaling law here, and finding it will permit us to be even more principled about robustness to memorization. My personal reading of this experiment is that SD is probably pretty close to the pareto boundary here, and we could probably flush out the memorization phenomenon entirely if we train it on more data or trim away at the capacity tinker with the model's topology.

26

axm92 t1_j6uf2a7 wrote

Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

​

Notably, take a look at the section on GSM-hard (4.1). You may also enjoy the analysis in the new version of the paper (Section 6: https://arxiv.org/pdf/2211.10435.pdf).

​

Further, "Let's think step by step" is outperformed by "Write Python code to solve this." We'll add the numbers in the next version, but if you are interested please lmk and I can share the results earlier.

Thanks again for reading our work and sharing your feedback, I really appreciate it.

1

gdahl t1_j6uc1bh wrote

Deep learning existed as a field in 2012. The speech recognition community had already adopted deep learning by that point. The Brain team at Google already existed. Microsoft, IBM, and Google were all using deep learning. As an academic subfield, researchers started to coalesce around "deep learning" as a brand in 2006, but it certainly was very niche at that point.

23

gdahl t1_j6ubet7 wrote

Deep learning roles 10 years ago (in 2013) were pretty similar to what they look like now, except they are much more numerous now. I'm sure there will be some changes and a proliferation of more entry-level roles and "neural network technician" roles, but it isn't going to be that different.

1

LetterRip t1_j6u7cu9 wrote

In my view something like "Let's think things through step by step" prompt is extremely generic and requires no knowledge specific to the upcoming questions.

I was basing my comment on the content of this folder mostly,

https://github.com/reasoning-machines/pal/tree/main/pal/prompt

Each of the prompts seem to require extensive knowledge of the test set to have formulated the prompts.

This seems more akin to Watson where the computer scientists analyzed the form of a variety of questions and did programs for each type of question.

1

jtpaquet OP t1_j6tzmhw wrote

Ok thanks I’ll look into it, I was thinking maybe to do minimax as a base for RL so it has already a starting point to improve. I considered checking every possible option at first but ruled it out since I thought there would be too many outcomes. Pruning seems to reduce the number of outcomes so that might be possible after all. Thanks for making me see the problem in an other way!

1

PassingTumbleweed OP t1_j6tz0wq wrote

I agree everyone should take predictions with a huge grain of salt (obviously some clever person might find a way to make Open-ChatGPT on mobile... We can only hope), however this does seem like a conversation worth having, since LLMs appear to have a massive impact across many areas at once. Already I find a lot of the insights here interesting!

2

axm92 t1_j6tw995 wrote

Thanks! Can you please clarify what do you mean by prompts are specific to the datasets for PaL?

​

As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything). The prompts/code is also open sourced at https://reasonwithpal.com/ if you want to check if out!

Incidentally, the idea that Python programs lead to faithful reasoning chains was used in PaL to create a new split of GSM, called GSM-hard. GSM-hard is available on huggingface.

​

(I'm a co-author of the PaL paper. )

2

Screye t1_j6tu8mc wrote

> in ten years?

10 years ago was 2012. Deep Learning didn't even exist as field back then.

Tempting as it might be, I'd recommend caution in predicting the future of a field that went from non-existence to near-dominance within its profession in the last 10 years.

49