Recent comments in /f/MachineLearning

numpee t1_j67dlt8 wrote

So with ImageNet, the main performance (speed) bottleneck is actually data loading, especially if your models are not that large (such as Res18, Res50). ImageNet is roughly 1.2M images (train) that are roughly <1MB each, which means that you're performing random reads 1.2M times for each epoch. Modern NVME SSDs have great sequential read speeds, but still lack random read accesses (which is the case if you're shuffling the image order at each epoch). BTW, data loading won't be a bottleneck if you're training models like ViT or even Res152.

I highly suggest you try out a dataset format such as FFCV or WebDataset. I personally use FFCV, which is extremely fast because it caches data onto your RAM. But there definitely are some limitations, such as code compatability or not enough RAM to cache all images (this is something you should check on the server-side). You can remap ImageNet to the FFCV/WebDataset format on a local machine, then transfer your data to the server for training.

Just for reference, one epoch of training ImageNet on 4x A6000 (roughly 2~2.5x slower than A100) with Res18 takes me around 3 minutes using FFCV. But, using A100s won't necessarily be faster because even with FFCV, data loading itself takes 2~3mins without model forward/backward. IIRC, with ordinary data loading, you'd be looking at around 10~15 minutes per epoch.

If you want more details, feel free to DM me.

4

golongandprosper t1_j67dc19 wrote

I read an article that it’s so good because they hired “almost slaves” at lowest possible price.. $2 was the rate.. don’t know if that’s per day or hour.. from some downtrodden country.

And hundreds to thousands of these serfs spent their days testing and manually training it. So they apparently got hundreds of thousands of hours of human manual training, at a price that many Americans could afford by taking a mortgage against their house- and apparently they are still there manually watching and reacting to queries in real time to verify answers are decent.. while the rest of the world gives them more data for free.

So when it says the servers are busy, to wait? That could mean the humans are busy ;p

1

Maximum-Nectarine-13 t1_j67czqw wrote

Here is a recent and similar text-to-music work, the generated music sounds better to me than musiclm. Check the Waveform model in https://noise2music.github.io/

It doesn't have the full paper yet. Copy its abstract here.

>We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music.
>
>We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.

3

nmfisher t1_j67czbb wrote

Those models so exist (just search for "ASR seq2seq"), it's just that CTC has always been faster/more stable/more effective method for training since it avoids the need to learn specific alignments between input features and output "units" (phonemes/subwords/letters/whatever).

The view was that encoder/decoder modesl needed considerably more training data/longer training times, and usually underperformed. However, I just came across https://arxiv.org/pdf/2205.01086.pdf which found a method for fine-tuning a pre-trained seq2seq encoder that actually outperformed CTC in on small datasets, so that may no longer be the case.

1

cthorrez t1_j67csjx wrote

That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".

Is finetuning similarly unstable for small sized LMs?

1

golongandprosper t1_j67azrb wrote

I wouldn’t think so. The code for the video is digital, and patterns can be detected from the rendered frames, while a monitor displays converted data to analog light patterns. The only reason for a monitor is if the detector is a camera in front of the monitor sensing light patterns, then it would convert to digital patterns similar to the orginal code. That may be useful for interacting in the analog world and accounting for the way light reflects in an analog space, but I think that’s future tech, or maybe automated cars. You’d hope they’ve done some control/experiment to account for lighting changes like this

1

Complex_Candidate_28 t1_j67aytx wrote

Because for small-size LMs, ICL is unstable, i.e., it sometimes degrades to classifying all examples into one category. The protocol tries to ensure analyzing ICL when it works well. (For much larger-size LMs, the performance variance would be much smaller, where this step can be ignored.)

1

currentscurrents OP t1_j674tf3 wrote

They have some downsides though. HOGWILD! requires a single shared memory, and horovod requires every machine to have a copy of the entire model.

A truly local training method would mean your model could be as big as all the machines put together. The order of magnitude in size increase could outweigh the poorer performance of forward-forward learning.

No idea how you'd handle them coming and going, you'd have to dynamically resize the network somehow - there are still other unsolved problems before we could have a GPT@home.

15

Red-Portal t1_j673lux wrote

> If all your layers are on different machines connected by a high-latency internet connection, this will take a long time.

This is called model parallelism and this is exactly why you don't want to do it.... unless you're forced to do so. That is, at the scale of current large language monstrosities, the model might not fit on a single node. But other than that, model parallelism is well known to be bad, so people avoid it. Nonetheless, this is a known issue and lots of work has been done in improving data parallelism with asynchronous updates like HOGWILD! and horovod, because we know this scales better.

19

picardythird t1_j66kza2 wrote

Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.

Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).

I'm also not especially convinced by the Painting Caption Conditioning section. I suspect that there is quite a bit of Barnum Effect going on here; the captions are primed to be accepted as corresponding to the "correct" paintings because they are presented that way, but this is just a framing device. As a self-experiment, play a track from any of the paintings, and look at any of the other paintings. Can you really say that the track could not feasibly correspond to the "other" painting? (Also, as someone who has literally written a piece of music inspired by the Caspar David Friedrich painting, I find myself unconvinced by the model's interpretation... but this is a wholly subjective critique).

This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.

22