Recent comments in /f/MachineLearning

YoutubeStruggle OP t1_j6efuzb wrote

I agree, but the point is AI, and e.g. chatGPT, will always have one way to generate content. Whereas humans may have diverse ways of writing and suppose if we consider an essay or an article, the way of writing by a human would vary with every single sentence but it would remain the same for AI throughout. That's how AI-generated content can be detected. If we do para-wise analysis, we would get better results and a clearer picture but it won't be the same for sentence-wise analysis. And there should not be any possible way that for a particular human, all the generated paragraphs come out to be detected as AI-generated.

−10

Red-Portal t1_j6efht6 wrote

That's a more recent trend. Until the late 2000s, computer vision was basically combining machine learning techniques with image processing: Design filters to extract features, and slap them into a classifier. Naturally, lots of Fourier, wavelets, and other weird bases. Very different times.

8

a_khalid1999 OP t1_j6eev77 wrote

Interesting perspective. I did not know Computer Vision was a EE-dominant field at one point, I mean I knew Image Processing is a EE thing, but Vision just gave the ... CS vibe, I mean when I took it in my Bachelor's it was labelled as a CS course.

So basically one way of looking at things could be, and as a EE I'm obviously biased, it's not the Signal Process Engineers moving into ML, it's the CS guys starting to use Signal Processing, cuz where I've been the impression is always that AI is completely a CS thing and the EE's coming in this field are coming due to the lack of job opportunities

3

Red-Portal t1_j6edkjk wrote

One of the bull's eye contributions of signal processing to deep learning was this paper. From a signal processing perspective, naive pooling is obviously problematic because you're decimating without limiting the signal bandwidth. That paper showed that in 2019. Shows how much computer vision has changed from an EE-dominant field to a CS field, where signal processing is not common knowledge.

7

mkzoucha t1_j6ed1z9 wrote

I did not have time to try this specific one but I have tried at least 10 others. Sorry, not trying to be negative or anything. They’re are just tons of different models, each of which would need a separate detection model. The model was trained on human writing, so it’s bound to have humanistic sound, and some humans are bound to have a writing voice similar to the output of AI content creators. There is also no real standard ‘human’ way of writing to clearly separate the two. Combine that with the difference in results based on the prompt and it quickly becomes an insurmountable task in my opinion.

At the end of the day, I applaud your efforts, truly but realistically I think your model is significantly overfit to a very small percentage of possible samples, both AI and human generated.

9

YoutubeStruggle OP t1_j6ec8vq wrote

Reply to comment by MrEloi in [P] AI Content Detector by YoutubeStruggle

The use of AI tools should definitely be appreciated. It is saving a lot of time and as a fellow developer, I would highly encourage it. But the classification of human-generated content is necessary as AI-generated content could be misleading, making it important to distinguish it from human-generated content. Also detecting AI-generated content can help ensure the quality of information being shared and consumed, especially in sensitive domains such as news and medicine.

1

albertzeyer t1_j6ebian wrote

It's a bit strange indeed that the GCP or Azure results are not so great. As said, I do actually research on speech recognition, and Google is probably the biggest player in this field, and usually always with the very best results.

My explanation is, they don't really use such good and big models for GCP. Maybe they want to reduce the computational cost as much as possible.

But you also anyway have to be a bit careful in what you compare. Your results might be flawed when your finetuning data is close to your validation set (e.g. similar domain, similar sound conditions). Because in case of GCP, they have very generic models, working for all kinds of domains, all kinds of conditions.

1

Vegetable-Skill-9700 OP t1_j6e8o99 wrote

Firstly, by measuring data drift and analyzing user behavior, UpTrain identifies which prompts/questions were unseen by the model or the cases where the user was unsatisfied with the model output. It automatically collects those cases for the model to retrain upon.

Secondly, you can use the package to define a custom rule and filter out relevant data sets to retrain ChatGPT for your use case.

Say you want to use LLM to write product descriptions for Nike shoes and have a database of Nike customer chats:
a) Rachel - I don't like these shoes. I want to return them. How do I do that?
b) Ross - These shoes are great! I love them. I wear them every day while practicing unagi.
c) Chandler - Are there any better shoes than Nike? 👟 😍
You probably want to filter out cases with positive sentiments or cases with lots of emojis. With UpTrain, you can easily define such rules as a python function and collect those cases.

I am working on an example highlighting how all the above can be done. It should be done in a week. Stay tuned!

3