Recent comments in /f/MachineLearning

BlazeObsidian t1_j6xbu8f wrote

You can try Kaggle notebooks and Google Colab notebooks but they don't persist for that long. They typically shut down after 6 hours. You'll have to periodically save your best model/hyperparameters but that might be a viable free option.

Google Colab also has a paid option where you can upgrade the RAM, GPU etc.. to meet your needs.

But I am curious as to why it's taking 21 hours. Have you checked in your course forums/discussions for the expected time ?

https://www.kaggle.com/

https://colab.research.google.com/

4

znihilist t1_j6xa0o3 wrote

My point is more to the fact that f(x) doesn't have 3.95 in it anywhere. Because another option would be to write f(x) as -(x-2)(x-3)(x-4)*1/6 -(x-1)(x-3)(x-4)*3.95/2 -(x-1)(x-2)(x-4)*9.05/2 + (x-1)(x-2)(x-3)*16.001/6 this recreates the original points, plug in 1 and you get -(-1)(-2)(-3)*1/6 -(0)(-2)(-3)*3.95/2 -(0)(-1)(-3)*9.05/2 + (0)(-1)(-2)*16.001/6 which is just 1.

This version of f(x) has "memorized" the inputs and is written as a direct function of these inputs, versus x^2 which has nothing in it that is retraced to the original inputs. Both of these functions are able to recreate the original inputs. Although one to infinite precision (RMSE = 0) and the other to an RMSE of ~0.035.

I think intuitively we recognize that these two functions are not the same even beyond their obvious differences (first is a 4th order power function, and the other is a 2nd order power function), either way. Point is, I think "memorize" while applicable in both cases, one stores a copy and the other is able to recreate from scratch, and I believe they do mean different things in their legal implications.

Also, I think it is very interesting the divide on this from a philosophical point of view, and with the genie being out of the bottle, then beside strong societal change and pressure that genie is never going back to the bottle.

1

visarga t1_j6x8zna wrote

I think open source implementations will eventually get there. They probably need much more multi-task and RLHF data, or they had too little code in the initial pre-training. Training GPT-3.5 like models is like a recipe, and the formula + ingredients are gradually becoming available.

3

schwagggg t1_j6x7eh7 wrote

i recently found a paper from Blei’s lab that use NF to learn klpq instead of klqp variational inferences (might be what the other commenter is referring to), but i’m afraid that’s not what u r interested in.

then apart from that the last SOTA i can remember was GLOW applied application wise.

2

TrevorIRL t1_j6x5uer wrote

So it costs them $100 000/day to run

30 days * $100 000/day = $3 million a month in costs

10 million users * 20% who will buy (Pareto Principle) = 2 million users who buy a subscription.

2 million * $20/month = $40 000 000/ month in revenue.

Assuming I did my math right, that’s some pretty amazing margins and it’s only going to get better!

4

visarga t1_j6x1uwy wrote

> The extent to which something is memorized ... is certainly something to be discussed.

One in a million chance of memorisation even when you're actively looking for them is not worth discussing about.

> We select the 350,000 most-duplicated examples from the training dataset and generate 500 candidate images for each of these prompts (totaling 175 million generated images). We find 109 images are near-copies of training examples.

On the other hand, these models compress billions of images into a few GB. There is less than 1 byte on average per input example, there's no space to have significant memorisation. Probably why there were only 109 memorised images found.

I would say I am impressed there were so few of them, if you use a blacklist for these images you can be 100% sure the model is not regurgitating training data verbatim.

I would suggest the model developers remove these images from the training set and replace them with variations generated with the previous model so they only learn the style and not the exact composition of the original. Replacing originals with variations - same style, different composition, would be a legitimate way to avoid close duplication.

2

alpha-meta OP t1_j6x1r2j wrote

But isn't this only if you train it on the loss (negative log-likelihood) via next-word prediction, i.e., what they do during pretraining?

If you use the ranks (from having users rank the documents) to compute the loss on the instead of the words as labels, would that still be the case?

4