Recent comments in /f/MachineLearning

cthorrez t1_ja70abd wrote

I find it a little weird that RLHF is considered to be reinforcement learning.

The human feedback is collected offline and forms a static dataset. They use the objective from PPO but it's really more of a form of supervised learning. There isn't an agent interacting with an env, the "env" is just sampling text from a static dataset and the reward is the score from a neural net trained on a static dataset.

15

_learn_faster_ OP t1_ja6zovh wrote

We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.

Let me know what you think would be best for us?

When you say compression do you mean things like pruning and distillation?

1

impossiblefork t1_ja6rt6s wrote

I doubt it's possible, but I imagine something like [ed:the] DAN thing with ChatGPT.

Most likely you'd talk to the AI such that the rationality it has obtained from its training data make it reason things out that it's owner would rather it stay silent about it.

1

darthstargazer t1_ja6qchw wrote

Haha I'm sure u will get some good advice! I would say 1. Enjoy the trip 2. If you are aiming postdocs time to do some networking. 3. Enjoy the free food! 4. If you don't understand much about what other papers are about don't stress 😊

1

CellWithoutCulture t1_ja6pjet wrote

Seems more like an AskML question.

But RL is for situations when you can't backprop the loss. It's noisier than supervised learning. So if you can use supervised learning, then that's what you should generally use.

RL is still used, for example the recent GATO and Dreamer v3. Or used in training an LLM to use tools like in toolformer. And also OpenAI's famous RLHF, which stands for reinforcement learning with human feedback. This is what they use to make ChatGPT "aligned" although in reality it doesn't get there.

12