Recent comments in /f/MachineLearning

FastestLearner OP t1_j4uilg8 wrote

Yes. I initially thought of having a neural net trained on the audio track of a particular YT video, but I think the transcripts would provide just enough information, and fine tuning existing language models would work quite well especially with the recent tremendous growth of NLP. Collecting the audio would also require far more storage space than text, and would probably require more RAM, VRAM and compute.

If you are leaning towards crowd-sourcing the inference, I think it would be possible to do that using JS libs (such as TensorFlow.js), although I have no experience of these. The good thing is, once you do an inference on a video, you just upload them to the central server and everyone can get it for free (not requiring further inference costs).

1

FastestLearner OP t1_j4uhkbm wrote

Yes. I did think about that and potential solutions could be:

(1) A startup offering services in exchange of a small fee - The good thing about it is that once you do an inference on a video, you can serve it to thousands of customers with no additional cost (except for server maintenance and bandwidth, but no extra GPU cost other than the first time you ran it on a particular video).

(2) Crowd sourced inference - The current state of the sponsor-blocking extension is that it requires manual user input which it sources from the crowd and collects at a central server. So it's basically crowd-sourced (or peer-sourced) manual labour. I'm sure if someone could come up with an automated version like an executable which runs in the background with very small resource usage, then inference can be done via crowd-sourcing too, the timestamps can then be collected to a central server and distributed across the planet. The good thing about this is that as more and more people join in to participate in the peer-sourced inference, the lower would be the cost of keeping any one peer's GPU busy.

1

FastestLearner OP t1_j4ufxgc wrote

I am not well acquainted with NLP tasks. So I have no idea of how much resource it would need to get a transformer trained on it (or finetune an existing model like BERT on the dataset). If resources are a concern, one could do a crowd sourced training, like LeelaChessZero. I think it's a matter of time someone comes along and does this, because blocking ads is the inevitable future of the internet. Also, some company/startup can do it on a subscription model like the already existing paid adblocking softwares. It's a potential startup idea IMO.

1

JClub OP t1_j4uc0bg wrote

PPO's formula makes the gradient update always rather smaller than other RL algorithms. I get that the reward is measuring the human's preference but that does not answer my question 🤔 : what rewards work best for PPO?

1

buzzbuzzimafuzz t1_j4u5jrz wrote

I think what OpenAI and Anthropic typically do is providing evaluators with two possible responses and having them select which one is better. If you have numerical ratings, it might be hard to calibrate them. From the original paper "Deep reinforcement learning from human feedback" (2017):

>We ask the human to compare short video clips of the agent’s
behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful

ChatGPT seems to be trained from a combination of expert-written examples and upvotes and downvotes on individual messages.

9

2blazen OP t1_j4u5jf7 wrote

With my RTX 3060 it takes 3m50s to diarize 1 hour, 20m to do 3 hours (although can be reduced to 16m by presetting the number of speakers - I didn't check 1h segment like this, also keep in mind it takes time to load the models into vram), however 5 hour episodes keep getting my process killed after around 40m. It's probably a memory issue, and could even happen during the segmentation, but reusing clusters is a common issue on Github, it wouldn't just be for my usecase

2

nmfisher t1_j4typw0 wrote

If I was using one of the newer search engines that let you block domains then Medium would definitely be on my blacklist. The signal-to-noise ratio is just way too low.

towardsdatascience might be slightly better but even if you find something worthwhile, it's probably available somewhere else that doesn't clog up your search results.

1

mrconter1 OP t1_j4tuaal wrote

> This is not testing intelligence, this is testing if human was trained on computer usage, knows what e-mail is and used gmail before.

I don't think it's binary. I think intelligence is a large part here.

> Someone from tribe in Africa would fail your test while he is human and is intelligent,

Could you train a bird to pass all questions on this benchmark? No. Because it's not as intelligent as a human.

> train him on this task like you would train current gen multimodal system and it will pass your benchmark. You train LLM in combination with image model and RL model, train on instruction following using inputs you described and now it understands what it sees, can follow what you want it to do.

Solving this benchmark is an easy problem? How long do you think it will take until we have a model that can causually solve all the instructions a gave in the previous comment?

1

velcher t1_j4ts9n0 wrote

Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).

You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.

Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.

Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.

Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).

5

BrotherAmazing t1_j4tdklr wrote

No it’s not.

Anyone who wanted to cheat on a take-home essay or assignment always could, and anyone who has to write an essay in-class monitored for more critical and competitive standardized tests cannot be pulling out their devices and typing into chatGPT, which doesn’t write A+ essays that a teacher can’t detect are “a little off” anyway.

As a former educator myself, I always knew which students had mastered the material and could intelligently talk about it in class discussions, during office hours, and through in-class essays/quizzes where they could not cheat while I closely monitored. They couldn’t get an A+ by simply cheating on a few of the take-home essays, and the typical cheaters are cheating just to get by and still end up with inferior grades to those who master the subject.

Furthermore, concentrating too much on catching cheaters takes away from time you could be spending enriching the learning experience of everyone else.

It also sounds corny but is true: When you cheat, you’re only cheating yourself. Cheating really is self-policing in many instances. When we interview candidates who have a degree and a high GPA, it’s very obvious of they just got good grades but are clueless and we don’t hire them. It might be cheating, or maybe grade inflation, or perhaps just short-term memorizing but not actually retaining or understanding what they were learning, but it’s night and day.

Those who truly care to learn will excel in their jobs and get better promotions. ChatGPT isn’t going to help you there.

Having said that, I would consider possibly modifying the curriculum you only give take-home work that is 90% of the grade and can, but it’s not worth stressing over. Put your effort into teaching and enriching the lives of those who want to learn and yearn for knowledge. You’re an educator first, and police work is just a side gig you can’t ignore, but isn’t your main purpose.

2

blose1 t1_j4td3lq wrote

>Recognize the Gmail icon of I say "send an email"

This is not testing intelligence, this is testing if human was trained on computer usage, knows what e-mail is and used gmail before.

Someone from tribe in Africa would fail your test while he is human and is intelligent, train him on this task like you would train current gen multimodal system and it will pass your benchmark. You train LLM in combination with image model and RL model, train on instruction following using inputs you described and now it understands what it sees, can follow what you want it to do.

1

MrAcurite t1_j4t9ch1 wrote

At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.

1