Recent comments in /f/MachineLearning
sothatsit t1_j5hhb31 wrote
Reply to [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
I’ve actually done some work on this and the real issue here is that:
- You’d need a lot of text from other sources with people’s real names.
- You’d need the user to have written a lot of Reddit comments or posts.
- The style of user’s writing would need to match between Reddit and your other source.
If you’re interested though, I made the following library for my Master’s thesis, which can be used for this: https://github.com/TycheLibrary/Tyche
However, it would need more work to get close to identifying thousands, never mind millions, of users.
Loquzofaricoalaphar OP t1_j5hf96p wrote
Reply to comment by neanderthal_math in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
Thanks, that’s very interesting resource.
neanderthal_math t1_j5henyu wrote
Reply to [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
People have been working on the Author Identification problem for about 20 years.
https://dergipark.org.tr/en/download/article-file/2482752
https://en.wikipedia.org/wiki/Author_profiling?wprov=sfti1
There is no way to unmask all of Reddit though. Too many people and many text samples are way too short. Some Redditors only speak in emoji and gif.
Forsaken-Indication t1_j5hc9hz wrote
Reply to comment by jpercivalhackworth in [D] How to deal with COVID-19-era data for time series forecasting? by PM_ME_YOUR_GIGI
OP said it did. And that after Jan 2022 they see a return to some sort of baseline.
Trying to predict the next global pandemic as part of a product forecasting model seems pretty out-of-scope.
PryomancerMTGA t1_j5hbwrg wrote
Reply to [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
Trying to match all businesses with fuzzy matching is hard enough when you have misspellings. To think you could identify redditors with any degree of certainty is optimistic at best.
Z1ndabad t1_j5hbncl wrote
Reply to [D] Simple Questions Thread by AutoModerator
Hey guys, new to ML and cant seem to wrap my head around the concept. I was to make a used car price prediction model using large data set and most of the tutorials i watch just use the linear regression library. However can you use neural networks instead like Levenberg-marquat?
Loquzofaricoalaphar OP t1_j5h6s4z wrote
Reply to comment by PredictorX1 in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
That is interesting to think about. I’m biased to think text patterns have lots of variables and are fairly unique. Perhaps it’s more of a model than compute problem to analyze it at scale and not get mush.
PredictorX1 t1_j5h5pb5 wrote
Reply to comment by Loquzofaricoalaphar in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
The biggest technical challenges I see:
- Having enough reference samples from known people
- The difference how people write on Reddit and how they write elsewhere (professional articles, e-mail, etc.: presumably used as reference)
- If too many Reddit users are being considered, it may all dissolve into mush (estimated probabilities would all be low)
Loquzofaricoalaphar OP t1_j5h5kq4 wrote
Reply to comment by [deleted] in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
Perhaps It could return the top 10 likelihoods of the author of the account, some patterns of writing and and grammatical errors might be pretty unique and the more post it has the more unique right?
arkkienkeli t1_j5h5fa5 wrote
Reply to ChatGPT is not all you need [R] by EduCGM
There was a paper with a similar message 2 years ago: https://arxiv.org/abs/2103.05247
Loquzofaricoalaphar OP t1_j5h59id wrote
Reply to comment by PredictorX1 in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
So like if you fed it 200 peoples samples you were looking and then fed it Reddit? Perhaps all of Reddit would be tricky because some might not have public text and it would be difficult to label all the text on Facebook or link-en, etc.
PredictorX1 t1_j5h3ymz wrote
Reply to [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
>With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?
With labeled samples of text, I think it would be pretty easy to come up with a a likelihood model, giving a reasonable educated guess of the identity of some Reddit members, and I don't think it would take much computing power.
jpercivalhackworth t1_j5h3cvx wrote
What makes you think that COVID is not going to impact the demand for your product?
NovaBom8 t1_j5h30af wrote
Very cool, great work!!
In the context of running .pt (or any other device-agnostic filetypes), I’m guessing dynamic batching is the reason for Triton’s superior throughout?
Appropriate_Ant_4629 t1_j5gy6e3 wrote
Reply to comment by ThisIsNotAnAlias in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
> Last I checked image watermarks were super weak against rotations
Obviously depends on the technique. The old-school popular technique of "slap a signature in the painting" like Dürer's stylized A/D logo is very robust to rotations, but not robust to cropping from the bottom in that case.
> seems to still be the case - but the better methods could cope with cropping way better than these.
It's near impossible to have a watermark technology that's robust to all transformations, at least if you reveal what watermark algorithm you used.
One easy attack that works on most some techniques, would be to just re-encode the content, but writing your own watermark over the original using the same watermarking algorithm.
kannkeinMathe t1_j5gxi7i wrote
Reply to [D] Simple Questions Thread by AutoModerator
Hey you,
i want to build an chatbot for domain specify purpose, for example to talk with a person about its mental state and its depression. For that I would like to train the bot with texts from the domain.
So my question how should I start?
What is approach would you use? - Would you use an intent base solution?
What are the standard models for chatbots - BERT ?
Is it even possible to fine-tune models with large text corpuses ? - IF yes, how?
Thank you Guys
mje-nz t1_j5gw3n5 wrote
Reply to comment by twiztidsoulz in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
Are you talking about the model? We’re talking about the output. If you’re talking about signing the model, what does that achieve? If you’re talking about signing the output, how do you sign a chat transcript?
hasiemasie t1_j5gv221 wrote
Reply to comment by Maxerature in [D] Multiple Different GPUs? by Maxerature
Not to my knowledge
careless25 t1_j5gu7tn wrote
Reply to comment by franciscrot in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
Very simple explanation
Give each word a unique number. Add all the numbers up. That's your unique identifier for the gpt output
If you switched some words around the sum won't change very much.
Fabulous-Possible758 t1_j5gtzlk wrote
Reply to comment by adt in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
On phone so can’t read the blog yet: does it say how well it handles false positives? ie, flagging stuff not written by GPT as being written by GPT?
I could see a really shitty world coming about where the filter is effectively useless because everyone will need to have to make sure their content will pass the watermark detector.
fraktall t1_j5gtq5j wrote
Reply to [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
It will be reverse engineered in now time. The output of the model is just text.
romek_ziomek t1_j5hhn2h wrote
Reply to comment by Maxerature in [D] Multiple Different GPUs? by Maxerature
Of course you can use them both, provided that you have free PCI-E slots. I use 3060 and 2060 super in my setup. I'm not sure what exactly you wanna do, I can tell you that I'm working in pytorch and it's a painless process, you can choose with one variable which gpu to use, or use one wrapper class (DataParallel) to train on both of them simultaneously. One trick that was specific to my motherboard and I had to figure out by trial and error (since it wasn't in the documentation) was that my second gpu wouldn't work if I had two NVMe drives installed. Other than that it works flawlessly.