Recent comments in /f/MachineLearning

LetterRip t1_j6vo0zz wrote

> The model capacity is not spent on learning specific images

I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).

> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.

Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.

11

[deleted] t1_j6vhw51 wrote

You must be new here from a gaming subreddit or something where people talk like this, and not actually in a research field.

ChatGPT is the only free, self hosted product they have exposed people to. This is actually the norm for OpenAI and you would be dying on a stale hill.

Other than that their inference code is open. You can run a local version of GPT with your own code and a locally existing model right now (if you know what you are doing, minor caveat)

Same for their Whisper code. Doesn’t get more open than that. The compute required to train a multi billion parameter model isn’t something you could do anyways.

Lastly “open” doesn’t just mean free of cost. It means intellectually transparent about the code (this is always what it means). There’s no reason to confuse the two. It costs 100k per day to run these models so I’m not sure what leads you to think that risk should be part of an intellectually open philosophy when you can just deploy GPT yourself if you’re so inclined.

Welcome to the sub.

75

pm_me_your_pay_slips OP t1_j6vgxpe wrote

>on average it can learn 2 bits of unique information per image.

The model capacity is not spent on learning specific images, but on learning the mapping from noise to latent vectors corresponding to natural images. Human-made or human-captured images have common features shared across images, and that's what matters for learning the mapping.

As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?

1

neanderthal_math t1_j6v9qoj wrote

OK, I’ll bite. : )

The vast majority of coding data ingestion, mooel discovery, and training that we currently do will all go away.

The job will become much more interesting, because researchers will try and understand why certain architectures/training regimes are unable to perform certain tasks. Also, I think the architectures for some fundamental tasks like computer vision, and audio are going to become modular. This whole training models end to end is going to be verboten.

2

LetterRip t1_j6v57y5 wrote

Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.

18

-xXpurplypunkXx- t1_j6v3fab wrote

Thanks for context. Maybe a little too much woo in my post.

For me, the fidelity to decide which images are completely stored is either an interesting artifact or an interesting piece of the model.

But regardless it is very un-intuitive to me with respect to how diffusion models would train and behave, due to both mutation of training images as well as foreseeable lack of space to encode that much info into a single model state. Admittedly don't have much working experience with these sort of models.

1

starstruckmon t1_j6v3etd wrote

From paper

>Our attack extracts images from Stable Diffu- sion most often when they have been duplicated at least k = 100 times

for the 100 number. The 10 is supposed to be the number of epochs, but I don't think it was trained on that many epochs. More like 5 or so ( you can look at the model card ; it's not easy to give an exact number ).

13

Wiskkey t1_j6v0hqg wrote

The fact that Stable Diffusion v1.x models memorize images is noted in the various v1.x model cards. For example, the following text is from the Stable Diffusion v1.5 model card:

>No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.

11

znihilist t1_j6uz705 wrote

If you have a set of pair numbers: (1,1)..(2,3.95)..(3,9.05)..(4, 16.001)..etc These can be fitted with x^2, but x^2 does not contain anywhere the four pairs of numbers, but can recreate them to a certain degree of precision if you try to guess the x values.

Is f(x) = x^2 memorizing the inputs or just able to recreate them because they are in the possible outcome space?

36