Recent comments in /f/MachineLearning

ItsJustMeJerk t1_j6uymag wrote

You're right, it's not exclusive. But I believe that while the the absolute amount of data memorized might go up with scale, it occupies a smaller fraction of the output because it's only used where verbatim recitation is necessary instead of as a crutch (I could be wrong though). Anyway, I don't think that crippling the model by removing all copyrighted data from the dataset is a good long-term solution. You don't keep students from plagiarizing by preventing them from looking at a source related to what they're writing.

4

znihilist t1_j6uy7z0 wrote

I think people are using words and disagreeing on conclusions without agreeing first on what is exactly meant by those words.

I am not sure that everyone is using the word "memorize" the same. I think those who use it in the context of defense, are saying that those images are no where to be found in the model itself. It is just a function that takes words as an input and outputs a picture. Is the model memorizing the training data if it can recreate it? I don't know, but my initial intuition tells me there is a difference between memorizing and pattern recreation, even if they aren't easily distinguishable in this particular scenario.

19

DigThatData t1_j6uxsdj wrote

> full image comparison.

that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.

Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.

Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.

2

bubudumbdumb t1_j6uux46 wrote

I would expect a lot of work around regulation. Like probably formal qualifications requirements will emerge for who can tell a legal jury how to interpret the behavior of ML models and the practices of who develops them. In other words there will be DL lawyers. Lawyers might get themselves automated out of courtrooms: if that's the case humans will be involved only in DL trials and the LLMs will settle everything else from tax fraud to parking tickets. Do you want to appeal the verdict of the LLMs? You need a DL lawyer.

Coding might be automated but it's really a question of how much good code to learn from is out there.

Books, movies, music, VR experiences will be prompted. Maybe even psychoactive substances could be generated and synthesized from prompts (if a DL lawyer sign off the ML for it). Writing values will change: if words are cheap and attention is scarce writing in short form is valuable.

The real question is who we are going to be to each others and even more importantly to kids up to age 6.

−2

Agreeable_Dog6536 t1_j6uuq7r wrote

He's asking the opposite - remove the bits with no speech.

I used to do more or less this same thing manually, years ago, for a corporate vlog in which people drove around all day fixing pipe leaks and occasionally commented on what they'd done - they wanted the clips where they commented, edited together.

I basically just looked at the audio waveform and figured out where I should probably cut, and then listened to it to narrow it down.

If someone hasn't already trained an AI for this, they should.

1

LetterRip t1_j6ut9kc wrote

> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.

It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.

Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.

10

gdahl t1_j6upct4 wrote

I would say the turning point was when we published the first successful large vocabulary results with deep acoustic models in April 2011, based on work conducted over the summer of 2010. When we published the paper you mention, it was to recognize that these techniques were the new standard in top speech recognition groups.

Regardless, there were deep learning roles in tech companies in 2012, just not very many of them compared to today.

8

Nhabls t1_j6uokwb wrote

It's incredibly easy to make giant LLMs regurgitate training data near verbatim. There's very little reason to believe that this won't just start happening more frequently with image models as they grow in scale as well.

Personally i just hope it brings a reality check in the courts to these companies that think they can just monetize generative models trained on copyrighted material without permission

3

IDoCodingStuffs t1_j6uk67h wrote

~~In this case the paper seems to use a very conservative threshold to avoid false positives -- l2 distance < 0.1, full image comparison. Which makes sense for their purposes, since they are trying to establish the concept rather than investigating its prevalence.

It is definitely a larger number than 0.03% when you pick a threshold to optimize the F score rather than just precision. How much larger? That's a bunch of follow-up studies.~~

17