Recent comments in /f/MachineLearning
golongandprosper t1_j67dc19 wrote
I read an article that it’s so good because they hired “almost slaves” at lowest possible price.. $2 was the rate.. don’t know if that’s per day or hour.. from some downtrodden country.
And hundreds to thousands of these serfs spent their days testing and manually training it. So they apparently got hundreds of thousands of hours of human manual training, at a price that many Americans could afford by taking a mortgage against their house- and apparently they are still there manually watching and reacting to queries in real time to verify answers are decent.. while the rest of the world gives them more data for free.
So when it says the servers are busy, to wait? That could mean the humans are busy ;p
janck12 t1_j67d4qt wrote
I am not sure, if there is huge differences from one model to another. This is heavily depending on the training data that you can get.
I would suggest using some existing NER nodels and possibly fine tune them on your own data. Have a look at GENRE https://github.com/facebookresearch/GENRE
Maximum-Nectarine-13 t1_j67czqw wrote
Reply to [D] MusicLM: Generating Music From Text by carlthome
Here is a recent and similar text-to-music work, the generated music sounds better to me than musiclm. Check the Waveform model in https://noise2music.github.io/
It doesn't have the full paper yet. Copy its abstract here.
>We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music.
>
>We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
nmfisher t1_j67czbb wrote
Reply to [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ? by KarmaCut132
Those models so exist (just search for "ASR seq2seq"), it's just that CTC has always been faster/more stable/more effective method for training since it avoids the need to learn specific alignments between input features and output "units" (phonemes/subwords/letters/whatever).
The view was that encoder/decoder modesl needed considerably more training data/longer training times, and usually underperformed. However, I just came across https://arxiv.org/pdf/2205.01086.pdf which found a method for fine-tuning a pre-trained seq2seq encoder that actually outperformed CTC in on small datasets, so that may no longer be the case.
Complex_Candidate_28 t1_j67cx4a wrote
Reply to comment by cthorrez in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Yes, the size also affects finetuning but much less sensitive.
cthorrez t1_j67csjx wrote
Reply to comment by Complex_Candidate_28 in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".
Is finetuning similarly unstable for small sized LMs?
sEi_ t1_j67c4rp wrote
Reply to comment by thegreatmarker in [D] MusicLM: Generating Music From Text by carlthome
I have not slept since nov 2022 where it, for me, started to escalate by the release of Stable Diffusion.
golongandprosper t1_j67azrb wrote
Reply to comment by [deleted] in [D] Simple Questions Thread by AutoModerator
I wouldn’t think so. The code for the video is digital, and patterns can be detected from the rendered frames, while a monitor displays converted data to analog light patterns. The only reason for a monitor is if the detector is a camera in front of the monitor sensing light patterns, then it would convert to digital patterns similar to the orginal code. That may be useful for interacting in the analog world and accounting for the way light reflects in an analog space, but I think that’s future tech, or maybe automated cars. You’d hope they’ve done some control/experiment to account for lighting changes like this
Complex_Candidate_28 t1_j67aytx wrote
Reply to comment by cthorrez in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Because for small-size LMs, ICL is unstable, i.e., it sometimes degrades to classifying all examples into one category. The protocol tries to ensure analyzing ICL when it works well. (For much larger-size LMs, the performance variance would be much smaller, where this step can be ignored.)
cthorrez t1_j67aa39 wrote
Reply to comment by Complex_Candidate_28 in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
If the goal is the mechanism rather than the performance why tune the seed for performance in the first place? The examples used doesn't change the mechanism.
rainy_moon_bear t1_j676oo9 wrote
Reply to comment by maizeq in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
This is something people don't seem to understand. Pretty much all models 100B+ are undertrained.
Complex_Candidate_28 t1_j675z5i wrote
Reply to comment by cthorrez in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
The purpose of the experiments is not to compare the performance between them. The goal is to compare the mechanisms behind them. So it doesn't affect the conclusion itself. The point is to use the same set of examples for analysis.
nmfisher t1_j675n9m wrote
Reply to [D] ImageNet2012 Advice by MyActualUserName99
Have you tried applying for the Google TPU Research program?
currentscurrents OP t1_j674tf3 wrote
Reply to comment by Red-Portal in [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents
They have some downsides though. HOGWILD! requires a single shared memory, and horovod requires every machine to have a copy of the entire model.
A truly local training method would mean your model could be as big as all the machines put together. The order of magnitude in size increase could outweigh the poorer performance of forward-forward learning.
No idea how you'd handle them coming and going, you'd have to dynamically resize the network somehow - there are still other unsolved problems before we could have a GPT@home.
Red-Portal t1_j673lux wrote
Reply to [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents
> If all your layers are on different machines connected by a high-latency internet connection, this will take a long time.
This is called model parallelism and this is exactly why you don't want to do it.... unless you're forced to do so. That is, at the scale of current large language monstrosities, the model might not fit on a single node. But other than that, model parallelism is well known to be bad, so people avoid it. Nonetheless, this is a known issue and lots of work has been done in improving data parallelism with asynchronous updates like HOGWILD! and horovod, because we know this scales better.
[deleted] t1_j6722bh wrote
Reply to comment by [deleted] in [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents
[deleted]
[deleted] t1_j670cvp wrote
Reply to [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents
Commenting because I’m also interested.
[deleted] t1_j66ygkc wrote
Reply to comment by thegreatmarker in [D] MusicLM: Generating Music From Text by carlthome
[removed]
NormalManufacturer61 t1_j66u0fb wrote
Reply to [D] Simple Questions Thread by AutoModerator
I am a non-data scientist interested in a laymans to introductory level book/primer on the topic of ML/AI, specifically on the principles and mechanics of the topic(s). Any recommendations?
TankAttack OP t1_j66s7js wrote
Reply to comment by thatphotoguy89 in [D] Best large language model for Named Entity Extraction? by TankAttack
Cool, will have a look! Do they list any models as being good at question answering?
[deleted] t1_j66qjez wrote
Reply to [D] MusicLM: Generating Music From Text by carlthome
[removed]
thatphotoguy89 t1_j66n16v wrote
Reply to comment by TankAttack in [D] Best large language model for Named Entity Extraction? by TankAttack
Yeah. Here’s a HF tutorial on how to do this QA tutorial
picardythird t1_j66kza2 wrote
Reply to [D] MusicLM: Generating Music From Text by carlthome
Whenever I see music generation models, I immediately go to the "classical" examples (or as close to classical as are provided). The reason for this is that while some genres such as techno, drum 'n' bass, 8-bit, and hip hop are "simple" (from a music theory perspective), and other genres such as ambient, relaxing jazz, swing, and dream pop are vague enough that the model can get by just from spitting out the right general timbre, generating classical music requires understanding of structure, style, and form.
Frankly, I'm not particularly impressed. For the piano snippets, it seems to have mixed in sounds from strings, and both the "professional piano player" and "crazy fast piano player" snippets are basically just random notes with no particular structure. Meanwhile, the "opera" snippet uses piano sounds, which are non-idiomatic to opera. The "string quartet" snippets are not idiomatic to the style of a string quartet (in particular, the "camptown races" snippet completely falls apart at the end, and the "fingerstyle guitar" snippet barely even sounds like string instruments).
I'm also not especially convinced by the Painting Caption Conditioning section. I suspect that there is quite a bit of Barnum Effect going on here; the captions are primed to be accepted as corresponding to the "correct" paintings because they are presented that way, but this is just a framing device. As a self-experiment, play a track from any of the paintings, and look at any of the other paintings. Can you really say that the track could not feasibly correspond to the "other" painting? (Also, as someone who has literally written a piece of music inspired by the Caspar David Friedrich painting, I find myself unconvinced by the model's interpretation... but this is a wholly subjective critique).
This is not to say that the model is not impressive in other ways. Its ability to mimic the styles of different genres is quite good (although the "swing" example in the Long Generation section loses focus halfway through), and the style transfer elements are quite interesting as well. However, music generation models have a long way to go when it comes to idiomatic understanding of the structural elements of music.
[deleted] t1_j66jfh5 wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
[deleted]
numpee t1_j67dlt8 wrote
Reply to [D] ImageNet2012 Advice by MyActualUserName99
So with ImageNet, the main performance (speed) bottleneck is actually data loading, especially if your models are not that large (such as Res18, Res50). ImageNet is roughly 1.2M images (train) that are roughly <1MB each, which means that you're performing random reads 1.2M times for each epoch. Modern NVME SSDs have great sequential read speeds, but still lack random read accesses (which is the case if you're shuffling the image order at each epoch). BTW, data loading won't be a bottleneck if you're training models like ViT or even Res152.
I highly suggest you try out a dataset format such as FFCV or WebDataset. I personally use FFCV, which is extremely fast because it caches data onto your RAM. But there definitely are some limitations, such as code compatability or not enough RAM to cache all images (this is something you should check on the server-side). You can remap ImageNet to the FFCV/WebDataset format on a local machine, then transfer your data to the server for training.
Just for reference, one epoch of training ImageNet on 4x A6000 (roughly 2~2.5x slower than A100) with Res18 takes me around 3 minutes using FFCV. But, using A100s won't necessarily be faster because even with FFCV, data loading itself takes 2~3mins without model forward/backward. IIRC, with ordinary data loading, you'd be looking at around 10~15 minutes per epoch.
If you want more details, feel free to DM me.