Recent comments in /f/MachineLearning

AuspiciousApple t1_jb557q8 wrote

Not obviously so.

First, de-duplicating text data didn't help much in the cramming paper. Second, even if the images are duplicates, the captions might be different so you still learn more than if you only had one copy of each image.

Finally, even with exact copies of text and image, it would just weigh those images more heavily than the rest - which could harm performance, not matter at all, or even help performance (for instance if those images tend to be higher quality/more interesting/etc.)

98

Philpax t1_jb53hvo wrote

You are probably fine, but note that a) people will likely be very angry with you, whether or not the licensing permits it and b) this is a non-trivial problem and even more non-trivial to train.

Good luck, though!

5

I_will_delete_myself t1_jb532v5 wrote

Intelligence is the ability to take complex information into a simple explanation that a child can understand .

It makes me skeptical if someone doesn’t explain besides performance reasons . Most people just use the cloud because ML networks regardless of size take up a lot of battery.

−1

Hiitstyty t1_jb4wjnj wrote

It helps to think of the bias-variance trade off in terms of the hypothesis space. Dropout trains subnetworks at every iteration. The hypothesis space of the full network will always contain (and be larger) than the hypothesis space of any subnetwork, because the full network has greater expressive capacity. Thus, the full network can not be any less biased than any subnetwork. However, any subnetwork will have reduced variance because of its smaller relative hypothesis space. Thus, dropout helps because its reduction in variance offsets its increase in bias. However, as the dropout proportion is set increasingly higher, eventually the bias will be too great to overcome.

8