Recent comments in /f/MachineLearning

master3243 t1_j7tmpsz wrote

Exactly, the beginning "Clip" part of the entire Dalle model is trained to take any english text and map it to an embedding space.

It's completely natural (and probably surprising if it doesn't happen) that Clip would map (some) gibberish words to a part of the embedding space that is sufficiently close in L2-distance to the projection of a real world.

In that case, the diffusion model would decode that gibberish word to a similar image generated by the real word.

2

pommedeterresautee OP t1_j7tk4fx wrote

On large DL models like Whisper large, CPU is never on par with GPUs because CPU is latency oriented hardware and GPU is throughput oriented. The only ways large models are run on CPUs is by reducing the number of operations to perform like by sparsification or pruning.

Moreover, PyTorch is mostly C++ with a Python layer over it (for now at least, PyTorch 2.0 may be a start of change in this architecture). The Python layer brings most of the PyTorch latency.

And then, even C++ engine launching operations on GPU can not be on par with CUDA graphs (most of the time at least), because you have still to send instruction at a time, and there is still some latency overhead associated in running things that way, just much less than Python. With CUDA graphs there is almost none at all.There is a second thing not discussed here, it's that the graph of instructions is optimized.

Main drawback of CG is the memory overhead, you need at least to double the space taken for input tensors. On generative models with K/V cache, it matters as explained in this post. Plus you need to copy input tensors, which offsets a -very-small part of the gains (at least that s what we saw in our tests on Whisper and Bert / Roberta).

That is why TensorRT (a big C++ piece) for instance supports CUDA graphs.

Still, TBH, as you pointed out, the most important thing is that ... it's easier to build and run :-)

14

DigThatData t1_j7tb03a wrote

i'm not sure that's an emergent ability so much as it is explicitly what the model is being trained to learn. it's not surprising to me that there is a "painting signature" concept it has learned and samples from when it generates gibberish of a particular length and size in the bottom right corner (for example). that sounds like one of the easier "concepts" it would have learned.

11

andreichiffa t1_j7t9ul8 wrote

I am pretty sure that was an Anthropic paper first (Predictability and Surprise in Large Generative Models). Makes me truly wonder WTF exactly is going on in Google lately.

As to your question, no one has stacked enough attention layers yet, but there is very high probability that they will. Someone already mentioned the ability to spell, but it could potentially help with things such as hands, number of hands/feet/legs/arms/paws/tails and other things that make a lot of generated images today disturbing.

The issue will most likely be with funding enough data, given that unlike texts most images on the internet are copyrighted (cough Getty cough).

6