Recent comments in /f/MachineLearning

Smallpaul t1_ja6orxv wrote

> occasionally beat a much stronger player

We might occasionally win a battle against SkyNet? I actually don't understand how this is comforting at all.

> The world we live in is one of chance and imperfect information, which limits any agent's control over the outcomes.

I might win a single game against a Poker World Champion, but if we play every day for a week, the chances of me winning are infinitesimal. I still don't see this as very comforting.

2

Measurex2 t1_ja6ca1j wrote

They've been going after the Healthcare vertical heavily for the last decade, and Benioff has donated like half a billion to hospitals and research from his own wallet.

https://www.salesforce.com/solutions/industries/healthcare/pharma/health-care-innovation/

I buy every dip in Salesforce because they keep delivering in some many interesting ways. The news of the co-CEO and all their product and department leads leaving dropped their stock in Q4. It's up 25% since then.

11

currentscurrents t1_ja5n5xi wrote

Those are all internal rewards, which your brain creates because it knows (according to the world model) that these events lead to real rewards. It can only do this because it has learned to predict the future.

>PPO can handle this quite well.

"Quite well" is still trying random actions millions of times. World modeling allows you to learn from two orders of magnitude less data.

2

AmalgamDragon t1_ja5lz5b wrote

This really comes down to how 'reward' is defined. I think we likely disagree on that definition, with yours being a lot narrower then mine is. For example, during the cooking process, there is usually a point before the meal is done where it 'smells good', which is a reward. There's dopamine release as well, which could be triggered when completing some of the steps (don't know if that's the case or not), but simply observing that a step is complete is rewarding for lots of folks.

> Pure RL will quickly teach you not to touch the burner, but it really struggles with tasks that involve planning or delayed rewards.

Depends on which algorithms you're using, but PPO can handle this quite well.

1

currentscurrents t1_ja5isuz wrote

Imagine you need to cook some food. None of the steps of cooking give you any reward, you only get the reward at the end.

Pure RL will quickly teach you not to touch the burner, but it really struggles with tasks that involve planning or delayed rewards. Self-supervised learning helps with this by building a world model that you can use to predict future rewards.

1