Recent comments in /f/MachineLearning

farmingvillein t1_jadqg1l wrote

You're missing the point here, or I wasn't clear--the question isn't whether performance will improve with more params (and potentially) data; no doubt there.

The question is whether a model trained at scale on text & images will outperform a model trained at scale solely on text, in the text-only domain (or similarly, the image-only).

To-date, all* of the public research (and Kosmos is no different) on multimodal models have showed, at best, multimodal models generally performing equal to unimodal variants in unimodal domains. And often they are a shade worse (like Kosmos).

(*=unless you count code+natural language.)

The holy grail, of course, is that the two help one another, so that your multimodal variant outperforms the unimodal variants on unimodal tasks. GPT-* gets better at talking to you because it has ingested all of the Youtube videos in the world, e.g.

If you can demonstrate that (and it certainly makes intuitive human sense that this could/should be true), then of course there is a giant truckload of image (including video!) and audio data you can slam into your text models to make text-based scenarios better (and similarly for images, etc.). (And it also more plausibly suggests that massive amounts of synthetic world exploration data could be accretive, too...)

There is a bunch of research (https://arxiv.org/abs/2301.03728 being one of the most exciting) suggesting that this can occur, with enough data/params, but no one has publicly demonstrated it. (And it'd surprise no one, probably, if this was part of GPT-4's or Gato-2's mix.)

40

1azytux t1_jadp0aa wrote

do you know which foundation models we can use though, or are open sourced? It seems like every other model is either not available or their weights aren't released yet. It's case with, CoCa, Florence, Flamingo, BEiT3, FILIP, ALIGN. I was able to find weights for ALBEF.

7

dancingnightly t1_jadj7fa wrote

Edit: Seems like for this one yes. They do consider human instructions (similarish to the goal of a RLHF which requires more RAM), by adding them directly in the text dataset, as mentioned in 3.3 Language-Only Instruction Tuning-

For other models, like OpenAssistant coming up, one thing to note is that, although the generative model itself may be runnable locally, the reward model (the bit that "adds finishing touches" and ensures following instructions) can be much bigger. Even if the GPT-J underlying model is 11GB on RAM and 6B params, the RLHF could seriously increase that.

This models is in the realm of the smaller T5, BART and GPT-2 models released 3 years ago and runnable then on decent gaming GPUs

7

I_will_delete_myself t1_jad9amj wrote

ChatGPT uses GPT-3.5, which is a pre-trained model. Google uses pertained models. Facebook created a pre-trained model recently.

If these models satisfy their needs it will definitely satisfy yours. Unless if you are going beyond a kind of problem that hasn't been tackled before, a pre-trained model will save you so much time training and require a lot less data to get it to converge and actually be useful.

2