Recent comments in /f/MachineLearning

romek_ziomek t1_j5hhn2h wrote

Of course you can use them both, provided that you have free PCI-E slots. I use 3060 and 2060 super in my setup. I'm not sure what exactly you wanna do, I can tell you that I'm working in pytorch and it's a painless process, you can choose with one variable which gpu to use, or use one wrapper class (DataParallel) to train on both of them simultaneously. One trick that was specific to my motherboard and I had to figure out by trial and error (since it wasn't in the documentation) was that my second gpu wouldn't work if I had two NVMe drives installed. Other than that it works flawlessly.

3

sothatsit t1_j5hhb31 wrote

I’ve actually done some work on this and the real issue here is that:

  1. You’d need a lot of text from other sources with people’s real names.
  2. You’d need the user to have written a lot of Reddit comments or posts.
  3. The style of user’s writing would need to match between Reddit and your other source.

If you’re interested though, I made the following library for my Master’s thesis, which can be used for this: https://github.com/TycheLibrary/Tyche

However, it would need more work to get close to identifying thousands, never mind millions, of users.

3

neanderthal_math t1_j5henyu wrote

People have been working on the Author Identification problem for about 20 years.

https://dergipark.org.tr/en/download/article-file/2482752

https://en.wikipedia.org/wiki/Author_profiling?wprov=sfti1

There is no way to unmask all of Reddit though. Too many people and many text samples are way too short. Some Redditors only speak in emoji and gif.

13

Z1ndabad t1_j5hbncl wrote

Hey guys, new to ML and cant seem to wrap my head around the concept. I was to make a used car price prediction model using large data set and most of the tutorials i watch just use the linear regression library. However can you use neural networks instead like Levenberg-marquat?

1

Loquzofaricoalaphar OP t1_j5h6s4z wrote

That is interesting to think about. I’m biased to think text patterns have lots of variables and are fairly unique. Perhaps it’s more of a model than compute problem to analyze it at scale and not get mush.

1

PredictorX1 t1_j5h5pb5 wrote

The biggest technical challenges I see:

  1. Having enough reference samples from known people
  2. The difference how people write on Reddit and how they write elsewhere (professional articles, e-mail, etc.: presumably used as reference)
  3. If too many Reddit users are being considered, it may all dissolve into mush (estimated probabilities would all be low)
3

Loquzofaricoalaphar OP t1_j5h5kq4 wrote

Perhaps It could return the top 10 likelihoods of the author of the account, some patterns of writing and and grammatical errors might be pretty unique and the more post it has the more unique right?

−4

Loquzofaricoalaphar OP t1_j5h59id wrote

So like if you fed it 200 peoples samples you were looking and then fed it Reddit? Perhaps all of Reddit would be tricky because some might not have public text and it would be difficult to label all the text on Facebook or link-en, etc.

2

PredictorX1 t1_j5h3ymz wrote

>With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?

With labeled samples of text, I think it would be pretty easy to come up with a a likelihood model, giving a reasonable educated guess of the identity of some Reddit members, and I don't think it would take much computing power.

2

Appropriate_Ant_4629 t1_j5gy6e3 wrote

> Last I checked image watermarks were super weak against rotations

Obviously depends on the technique. The old-school popular technique of "slap a signature in the painting" like Dürer's stylized A/D logo is very robust to rotations, but not robust to cropping from the bottom in that case.

> seems to still be the case - but the better methods could cope with cropping way better than these.

It's near impossible to have a watermark technology that's robust to all transformations, at least if you reveal what watermark algorithm you used.

One easy attack that works on most some techniques, would be to just re-encode the content, but writing your own watermark over the original using the same watermarking algorithm.

7

kannkeinMathe t1_j5gxi7i wrote

Hey you,
i want to build an chatbot for domain specify purpose, for example to talk with a person about its mental state and its depression. For that I would like to train the bot with texts from the domain.
So my question how should I start?
What is approach would you use? - Would you use an intent base solution?
What are the standard models for chatbots - BERT ?
Is it even possible to fine-tune models with large text corpuses ? - IF yes, how?
Thank you Guys

1

Fabulous-Possible758 t1_j5gtzlk wrote

On phone so can’t read the blog yet: does it say how well it handles false positives? ie, flagging stuff not written by GPT as being written by GPT?

I could see a really shitty world coming about where the filter is effectively useless because everyone will need to have to make sure their content will pass the watermark detector.

3