Recent comments in /f/MachineLearning

Animated-AI OP t1_j9gvww8 wrote

Thanks for the feedback! I agree; the animations are only meant to be visual aids in the context of some larger explanation (lecture, blog post, etc). In my case, I'm making YouTube videos to serve as complete explanations.

Transformers have been the most requested topic on my YouTube channel. So I'm going to attempt to make videos/animations about that when I finish my current series on convolution.

24

currentscurrents t1_j9gvv4k wrote

a) Lower-dimensional features are useful for most tasks, not just output and b) Real data almost always has a lower intrinsic dimension.

For example if you want to recognize faces, you'd have a much easier time recognizing patterns in things like gender, shape of facial features, hair color, etc rather than raw pixel data. Most pixel values are irrelevant.

5

vladosaurus OP t1_j9gu3cr wrote

Ideally we have to generate many examples as such without seeing them and wrap them in some test suite using automatic differentiation to see how many will come out correct.

Something similar to what the authors did in the OpenAI Codex model. They provided the function signature and the docstrings and promted the model to generate the rest of it. Then they wrapped the generated function into test suites and calculated how many of them pass. It's the pass@K metric.

I am not aware if something similar is done for differentiation, maybe there is, I have to search for.

0

currentscurrents t1_j9gp4uq wrote

> From an information theory standpoint, it creates potential information loss due to the lower dimensionality.

Exactly! That's the point.

The bottleneck forces the network to throw away the parts of the data that don't contain much information. It learns to encode the data in an information-dense representation so that the decoder on the other side of the bottleneck can work with high-level ideas instead of pixel values.

If you manually tweak the values in the bottleneck, you'll notice it changes high-level ideas in the data like the gender or shape of a face, not pixel values. This is how autoencoders work; a unet is basically an autoencoder with skip connections.

Interestingly, biological neural networks that handle feedforward perception seem to do the same thing. Take a look at the structure of an insect antenna; thousands of input neurons bottleneck down to only 150 neurons, before expanding again for processing in the rest of the brain.

26

Animated-AI OP t1_j9gojs5 wrote

I'm using Blender and making heavy use of the Geometry Nodes feature. Unfortunately, these animations have taken a lot of effort and blender-specific knowledge, and building on top of my work for a new application would require more of both. But if others aren't deterred by that, I could publish the blender files.

23

pyepyepie t1_j9gnf6y wrote

This anecdote I have heard but I was kind of hoping for non-trivial cases from everyday life at work. I feel I understand SGD perfectly fine without learning to solve complicated DE but it's probably limiting me on other tasks, or my ability to analyze ML algorithms. Are you sure it's the right hierarchy to say that SGD is rooted in differential equations? I mean, I agree you are right, it's a differential equation, but are the methods you learn in differential equations courses useful for ML?

I found a nice article about the link to SGD: https://tivadardanka.com/blog/why-does-gradient-descent-work - but I am not sure if I am convinced (again, I am still an idiot about it, I shouldn't have any opinion regarding links to differential equations lol - but for me trying to fit SGD to the framework of differential equations is against the KISS principle). Sorry if I go too deep, I just try to figure out how much effort (I can actually study it all day for fun but we have work and so on) to put into it since we only have some amount of time :)

Thanks for the answer! I was convinced (by your message and myself today) it's terrible I don't know it and I should learn it ASAP.

1

MonsieurBlunt t1_j9glzsp wrote

Accomodating as much space for information as you can is not really a good idea. It is prone to overfitting and also harder to learn. You can think of it as a way of regularisation, you are forcing the model to get the useful information and not the rest or, you leave less space where it can encode the training data to overfit.

3

Professional_Poet489 t1_j9gk545 wrote

Re: regularization - by using fewer numbers to represent the same output info, you are implicitly reducing the dimensionality of your function approximate.

Re: (a), (b) Generally in big nets, you want to regularize because you will otherwise overfit. It’s not about the output dimension, it’s that you have a giant approximator (ie a billion params) fitting a much smaller data dimensionality and you have to do something about that. The output can be “cat or not” and you’ll still have the same problem.

9

TemperatureStatus435 t1_j9gk2mn wrote

Regularization in some vague sense applies, but there are different kinds of that, so you must be more specific. For example, an Autoencoder uses a bottleneck layer to learn information-dense representations of the domain space, and it may employ some mathematical regularization so that the raw numbers don’t explode to infinity.

However, a Variational Autoencoder employs the above methods, but also an additional type of regularization. The effect of this is to normalize the shape of the bottleneck layer so that it is close to Gaussian. This is extremely useful to do, but for entirely different reasons.

Long story short, don’t just say “regularization” and think you understand what’s going on.

6