Recent comments in /f/MachineLearning

Leptino t1_j4oxrdp wrote

It shouldn't be too difficult to produce a watermark provided the output is something on the order of a paragraph. However, I don't think its always possible. For instance if I ask ChatGPT to replicate the previous paragraph by replacing all nouns and verbs and to keep the same meaning.

Further tweaking by a human should completely destroy any residual.

1

armchair-progamer t1_j4ovjjm wrote

> digital watermark

Wouldn't it be easier to store the model outputs or a perceptual hash, and then provide a way to determine if some text is similar to prior ChatGPT output? I assumed they were already doing something like this to collect usage data as they scrape new content.

ChatGPT already has a unique writing style, I'm not sure how you could add anything to the text which couldn't be trivially removed and do better

1

nateharada OP t1_j4otocf wrote

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

4

sad_dad_is_a_mad_lad t1_j4ohl7t wrote

I don't think there are any laws that protect their data in this way, except perhaps contract law because they have a hidden ToS that you have to accept to use their service. As long as you use it for free though, I'm not sure there is consideration, and well... I don't know how they would go about proving misuse or damages.

Certainly it would not be copyright law, given that GPT3 itself was trained on copyrighted data...

2

MegavirusOfDoom t1_j4oelbd wrote

less than 500MB is used for code learning, 690GB is used for culture, geography, history, fiction and non-fiction... 2GB for cats, 2GB bread, horses, dogs, Cheese, Wine, Italy, France, Politics, Television, Music, Japan, Africa. less than 1% of the training is on science and technology, i.e. 300MB is biology, 200MB chemistry, 100MB physics, 400MB maths...

2

nmfisher t1_j4odkrt wrote

  1. Choose your niche (speech recognition/image classification/LLMs/whatever)
  2. Start your own blog with good* technical content (i.e. not the shovel crap you see on Medium), and see if you can write some guest posts for an existing blog with decent traffic. Open-source your code on GH. Spread on social media.
  3. Give presentations at a few local events and make it clear you're also available for freelancing.

It might take a month or two but people will start contacting you.

* this is important, your blog content/presentation actually has to be worth reading. It doesn't have to be cutting-edge, but it has to be novel enough to convince someone that you have something special to offer. Implementing a lesser-known paper and showing your results is usually a good start (also it teaches you just how hard it is to recreate something based on a paper alone).

19