Recent comments in /f/MachineLearning
pm_me_your_pay_slips OP t1_j6wn43x wrote
Reply to comment by DigThatData in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
That models that memorize better generalize better has been observed in large language models:
https://arxiv.org/pdf/2202.07646.pdf
https://arxiv.org/pdf/2205.10770.pdf
An interesting way to quantify memorization is proposed here, although it will be expensive for a model like SD: https://proceedings.neurips.cc/paper/2021/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf.
Basically: you perform K-fold cross validation and measure how much more likely the image is when included in the training dataset vs when it is not included. For memorized images, the likelihood of the images when not used in the dataset drops to close to zero. Note that they caution against using the nearest neighbour distance to quantify memorization as it is not correlated with the described memorization score.
iqisoverrated t1_j6wmqjs wrote
Either you do game theory or machine learning for this...but using both at the same time is sorta dumb because you'll be making either approach less effective.
enryu42 t1_j6wme0p wrote
Nice! It is pretty clear that big models memorize some of their training examples, but the ease of extraction is impressive.
I wonder what would be the best mitigation strategies (besides the obvious one of de-duplicating training images). Theoretically sound approaches (like differential privacy) will perhaps cripple the training too much. I wonder if some simple hacks would work: e.g. train the model as-is first, then generate an entirely new training set using the model and synthetic prompts, and train a new model from scratch only on the generated data.
Another aspect of this is on the user experience side. People can reproduce copyrighted images with just pen and paper, but they'll be fully aware of what they're doing in such case. With diffusion models, the danger is, the user can reproduce an existing image without realizing it. Maybe augmenting the various UI's with reverse image search/nearest neighbor lookup would be a good idea? Or computing training set attributions for generated images with something along the lines of tracin.
EducationalCicada t1_j6wkdlr wrote
Reply to comment by lmericle in [D] What does a DL role look like in ten years? by PassingTumbleweed
I would even say that neural networks are not the be-all end-all of machine learning.
Acceptable-Cress-374 t1_j6wiuq3 wrote
Reply to comment by the_architect_ai in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
Is plurabis open source? I just remembered that I listened to Lex's podcast with the guy that worked on plurabis, really interesting story. Can't remember if they released the software or not...
Acceptable-Cress-374 t1_j6wirsd wrote
Reply to comment by Mysterious_Ad_8286 in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
Not really wanting to contradict you, but how would they do that? The mere idea of detecting a poker playing bot seems much more complicated than detecting chess bots, and they're still having trouble over there. How'd you go about detecting bot play in a game with imperfect information, high variance and a very large decision state?
GoofAckYoorsElf t1_j6whme3 wrote
Reply to comment by Ulfgardleo in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
There is no simple answer to that. It clearly depends on the person whose work I use, on the purpose (fair use, inspiration), on the credit that I give, on the way, society benefits from either them clinging to their business model or me being allowed to use their work, on so many different things that there simply is no simple answer.
krand16 t1_j6whjv5 wrote
Much_Blacksmith_1857 OP t1_j6wg7vi wrote
Reply to comment by the_architect_ai in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
6-Max. We are looking at 8-Max.
badabummbadabing t1_j6wfsok wrote
Reply to comment by jimmymvp in [D] Normalizing Flows in 2023? by wellfriedbeans
Exact likelihoods are what attracted me to normalizing flows once, too. But I soon found them too hard to train to yield any useful likelihoods. The bijectivity constraint (meaning that your 'latent' space is just as large as your data space) seems like too much of a restriction in practice. For my application, switching to variational models and just accepting that I'll only get lower bounds on the likelihood got me further in the end. Diffusion models would be a more 'modern' option in this regard as well.
Are you aware of any applications, where people actually use NFs for likelihoods? I am aware of some research papers, but I'd say that their experiments are too much of a contrived example to convince me that this will ever find its way into an actual application.
Mysterious_Ad_8286 t1_j6wfj8u wrote
Reply to comment by Remco32 in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
Maybe OP is working on a new application?
MemeBox t1_j6wffzx wrote
In 10 years I'm not sure we will need humans at all, let alone DL specialists. Look at the progress curve, we are a hop skip and a jump from an Einstein in every home.
Mysterious_Ad_8286 t1_j6wfft9 wrote
Reply to comment by PigException in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
Bots have been banned from online poker, else we would not stand a chance
MemeBox t1_j6wfbl3 wrote
Reply to comment by visarga in [D] What does a DL role look like in ten years? by PassingTumbleweed
ha. So all people are useless? The walking talking GAI that is the human form is completely useless?
the_architect_ai t1_j6weyc4 wrote
Has already been done. plurabis
mongoosefist t1_j6wed0f wrote
Reply to comment by bushrod in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
When the latent representation is trained, it should learn an accurate representation of the training set, but obviously with some noise because of the regularization that happens by learning the features along with some guassian noise in the latent space.
So by theoretically, I meant that due to the way the VAE is trained, on paper you could prove that you should be able to get an arbitrarily close representation of any training image if you can direct the denoising process in a very specific way. Which is exactly what these people did.
I will say there should be some hand waving involved however, because again even though it should be possible, if you have enough images that are similar enough in the latent space that there is significant overlap between their distributions, it's going to be intractably difficult to recover these 'memorized' images.
NitroXSC t1_j6wdaje wrote
> Compute CLIP embeddings for the images in a training dataset.
A good follow-up question is to ask if it would be possible to recover a lot of the training data if you don't know the training data a priori.
red_b3 t1_j6wd8yx wrote
Reply to comment by alkibijad in [D] Apple's ane-transformers - experiences? by alkibijad
Second that!
[deleted] t1_j6wcxc9 wrote
Reply to comment by Monoranos in [N] OpenAI starts selling subscriptions to its ChatGPT bot by bikeskata
Stolen from whom? This comment you posted doesn’t belong to you. Images you post on Instagram don’t belong to you.
Can you explain your thinking a bit more?
Or are you basically realizing how important SOPA was 7 years later, well into the next AI boom when the horse has very much left the barn?
Perhaps you are young and inexperienced in this domain — or both?
BobSteva t1_j6wckhc wrote
Reply to [P] An open source tool for repeatable PyTorch experiments by embedding your code in each model checkpoint by latefordinnerstudios
Love the donuts on your website!
Ulfgardleo t1_j6wcfav wrote
Reply to comment by GoofAckYoorsElf in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
no you are now just writing what you like.
Is it right to use someone elses work without asking nor paying for it?
Mescallan t1_j6wbz6f wrote
Reply to comment by HateRedditCantQuitit in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
surmise-able information is not the same as memorization.
fraktall t1_j6wbugp wrote
I don’t get why graph NN aren’t attracting more attention tbh
PigException t1_j6wblon wrote
Reply to comment by ProSmokerPlayer in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
From what I heard it seems convincing that it has been solved, so does it mean that bots are now playing online poker?
iqisoverrated t1_j6wn72a wrote
Reply to comment by Acceptable-Cress-374 in [P] AI Poker/Machine Learning/Game-Theory by Much_Blacksmith_1857
>The mere idea of detecting a poker playing bot seems much more complicated than detecting chess bots
It just takes more hands to detect but it's not that hard. You can look at extremely low frequency plays that hit exactly the right frequency where a human would use an always/never approach. If you see such plays in different spots then you can be fairly confident it's a bot
(Just like in chess. A human could make all perfect moves - but after some perfect moves it just becomes very unlikely)