Recent comments in /f/MachineLearning

mil24havoc t1_j64ogl0 wrote

It basically means you read the paper and write the code to do what the paper describes yourself.

If you start with their code base, then your work is derivative of that copyrighted work and the question becomes a bit more complicated.

Yes, the line is fuzzy. However, it's typically very easy to stay on the "not copyright or license infringing" side of the line if you make an honest effort to rewrite the code from scratch and simply use their code base to check your understanding of the algorithm.

Again, IANAL but changing a for loop to a while loop is probably not sufficient to distinguish between their work and yours. Rewriting the code in another language may be. Rewriting it in the same language but making substantial changes to (for example) user interface, data preprocessing, training data, hyperparameters, etc... may be.

Edit: courts and lawyers usually aren't too concerned with technical details. Think of it like a book. The same story gets told over and over again by different authors who use different words to tell it. Your implementation needs to tell the same story but in different words, basically.

14

romantimm25 OP t1_j64mzgp wrote

What I always don't understand is the "reimplement" the algorithm.

I mean where lies the line between being too similar to the original and being completely different?

Of course there is the most obvious cases where one changes a "for loop" to a "while loop". But then does switching a certain library on which the paper's code depends on means that the implementation is different enough?

4

mil24havoc t1_j64lhxk wrote

IANAL but the copyright protects the paper's text, data, and the code. Algorithms themselves can't be copyrighted. If you reimplement the algorithm, you can do whatever you want with it.

Edit to add: licenses on (trained) models haven't been tested in court as far as I'm aware. I can imagine this being very complicated. Can you copyright and license a linear regression fit to simple economic data? For example: log(gdp) = alpha + beta×population? That seems silly. So why would a Transformer (e.g.) be any different? If you add Gaussian noise to every weight in a Transformer, is the license still valid?

20

EarthquakeBass t1_j64jhk3 wrote

https://en.m.wikipedia.org/wiki/Huang%27s_law

A bit of marketing flair for sure, but I think at the crossroads of hardware improvements, ensembling, clever optimizations etc. we will keep improving models at a pretty darn fast pace. GPT-3 alone dramatically has improved the productivity of engineers, I’m sure of it.

3

WarProfessional3278 t1_j649od6 wrote

Does anyone know of any good AI-generated text detectors? I know there's GPTZero but it's not very good in my experience.

My research has led me to Hive AI but I'm sure there are better alternatives out there that does not claim such good results (99.9% accuracy) while still having a lot of false positives in my tests.

1

HateRedditCantQuitit t1_j647xm6 wrote

I'm not sure how long you've been around, but before BPE came along, large vocabularies were actually quite a pain in the ass. You can find lots of literature around it before maybe 2016 (can't remember exact dates to look and I'm feeling lazy).

IIRC, a big issue was the final prediction layer. Say you're predicting a sequence 4k tokens long. Then you have 4k times vocab-size predictions. With a 50k token vocab, that's 200M predictions in memory (roughly 1 gig with floats). Lets say we want to equally compress 20x more languages, so we get 1M tokens (speaking super duper roughly), which means nearly 20GB just to represent the logits. If we wanted to handle a 40k long sequence, it's the difference between 20GB and 200GB of logits.

That said, BPE just takes in sequences of more-simple tokens. If you want to feed it unicode, go ahead. If you want to feed it something else, that will work too. It seems like you're mostly frustrated that LLM investments are focused on english right now, which is valid. Tech investments in general have a strong silicon valley bias, and a zillion people want to recreate that elsewhere. But that's a very hard economic question.

1

WikiSummarizerBot t1_j646sbr wrote

Ship of Theseus

>The Ship of Theseus is a thought experiment about whether an object that has had all of its original components replaced remains the same object. According to legend, Theseus, the mythical Greek founder-king of Athens, had rescued the children of Athens from King Minos after slaying the minotaur and then escaped on a ship to Delos. Every year, the Athenians commemorated this legend by taking the ship on a pilgrimage to Delos to honor Apollo. The question was raised by ancient philosophers: After several centuries of maintenance, if every part of the Ship of Theseus had been replaced, one at a time, was it still the same ship?

^([ )^(F.A.Q)^( | )^(Opt Out)^( | )^(Opt Out Of Subreddit)^( | )^(GitHub)^( ] Downvote to remove | v1.5)

1

john_the_jedi t1_j646qh1 wrote

Hey everyone, I'm the first author of this preprint paper
"A Watermark For Large Language Models": https://arxiv.org/abs/2301.10226
I thought I'd jump in with a few relevant comments about some questions in this thread, especially relating to our approach.

  1. Our watermark is mathematically constructed to minimize false positives (accusing human text of being machine generated), even if it costs us a few detections of actual machine generated text. At any sufficient length of text, say 100-200 words, there is near 0.0 chance of a false positive. This is obviously the type of error we'd all like to avoid as much as possible.
  2. We are not anti-LLMs in any general way, these are amazing tools for everyone to use! Rather, we think that it's much better to have a new tool, watermarks, embedded in these models sooner rather than later. A world in which we have limited (currently zero really) ways of distinguishing AI and human generated content is likely to have some difficult to wrestle with consequences. We're concerned with bot farms and accidentally retraining "GPT-10" on tons of old GPT-3 outputs by accident.
  3. On removing the watermark, we don't claim it is not removable, we just have constructed the watermark procedure so that it is difficult, and comes with a cost to the quality of the output. The fact that many people suggest that they'll just use another LM to paraphrase the output, or that they'll just paraphrase it themselves, gets at a philosophical point we couldn't spend too much time talking about in the paper (though we run some attack experiments trying to remove the watermark). A la the, ship of theseus, if you sufficiently re-write the watermark out of the text, well, it's no longer the original text anyway even though it feels conceptually similar. Rewriting and rephrasing a paragraph from a textbook, but in your own words, and then putting it in a term paper, has always been a way to try and pass off the thoughts and ideas of others as your own. This fact of the world is unchanged.
1

MadScientist-1214 t1_j6433qc wrote

At my institute, nobody trained on ImageNet, so I had to figure it out myself too. If you train on architectures like VGG, it does not take long. <2 days on a single A100, with worse GPU max. 5 days. The most important thing is to use SSD, this increases speed by around 2 days. A good learning scheduler is really important. Most researchers ignore the test set, use only validation set. And also important: use mixed precision. You should really tune the training speed, if you need to do a lot of experiments.

13

Quaxi_ t1_j6421fo wrote

And while being easier to train, they give better results.

Diffusion models are also so much more versatile in their application because of their iterative process.

You can do inpainting or img-to-img for example by just conditioning the noise in different ways. You would have to retrain the whole GAN to achieve that.

3

cthorrez t1_j63uc5a wrote

I have an issue with the experiments.

> For ICL, we fix the number of demonstration examples to 32 and tune the random seed for each task to find a set of demonstration examples that achieves the best validation performance. For finetuning, we use the same demonstration examples for ICL as the training examples and use SGD as the optimizer

They go through a set of random seeds to pick the "best" possible samples for in context learning, and then use the same set of examples for fine tuning. I think this biases the results in favor of in context learning.

A more fair way to do this would be to use a truly random set of examples, or to use use the same approach and tune the seed to find the "best" set of examples for finetuning as well.

5