Recent comments in /f/MachineLearning

tysam_and_co OP t1_j6hfdaj wrote

And, safe to say, there's stuff I'm not sharing here (yet?) that I've found as a result of that. Some hyperparameters are more network-specific, some are dataset specific. And some behave in ways just weirdly enough that you might get an entirely new adaptive method out of it... ;))))

I hadn't thought about that in the exact words you'd put it for a long while, but I think you're right very much so! It is quite a fun game, and very interesting to play in its own right. There's very much this bizzare, high-dimensional "messy zen" to it all. ;D

Thanks again for your comment, it warmed my heart and made me smile seeing it. Have a good evening/night/etc! :D :))) <3 <3 :D

3

Acceptable-Cress-374 t1_j6haw2s wrote

I tested this with text-davinci-03.

Prompt: Write a post about liking fruity icecream, in the style of a reddit post. Make some minor grammar mistakes while writing the post.

> hey everyone, so I'm totally loving this fruity icecream lately! I got it from a store near my house and its been so good. Its really sweet but not overly sickeningly sweet, like some other ice cream. Plus the fact that its fruity makes it even better in my opinion. Does anyone else like fruity icecream too? It'd be awesome to hear some of your thoughts on it!

This site gave me this:

85 %

HUMAN-GENERATED CONTENT

3

JamesBaxter_Horse t1_j6h8tib wrote

If I understand correctly, you're hyperparameter training with the intent to minimise training speed. What do you see as the purpose of this? Presumably, all you're achieving is successfully minimising the inductive space and optimising the learning parameters so as to converge as quickly as possible, but these results are completely specific to cifar and would not be reproducible on a different dataset.

19

tysam_and_co OP t1_j6h8nhh wrote

Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D

I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).

1

tysam_and_co OP t1_j6h8h8n wrote

I'm sure there's the classical leakage of the val set into the network design via val set performance for super tight tweaks. Thankfully some of the compression from simplifying things in the network seems to be a defending principle against that, but if I was hard-pressed, doing a final round with k-fold validation would probably be really good for final tuning runs.

There might be a CIFAR10 test set (see, goes to show how little I know about the CIFAR10 split, lol), and there has been a lot of work put into studying some aspects (and flaws, like mislabeling, etc) of the structure of CIFAR10 in general.

Mainly -- this task is primarily a proxy for "how fast can we ingest the information encoded in dataset A into a compressed form B so that it will perform adequately well on a task for a dataset we call C". It starts getting down to the engineering barebones of information flow at that point, and a lot of other really involved stuff that I don't want to break into while our training times are 4-5 seconds or so.

The concepts, if any usable ones, distilled from this, would apply to nearly any dataset and any problem -- say, training GPT on an absolutely massive oneshot dataset, for example. The math is the same anywhere you go.

I don't know if that answers your question or not, my apologies if I didn't understand properly.

5

batrobin t1_j6h8d9d wrote

Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.

I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.

3

amxdx t1_j6h51rh wrote

I studied EE Signal processing in Bachelors and Masters, and now work as a machine learning engineer.

In signal processing, you take the signal, define a desirable output and try to create a system to get to that output.

In ML, you typically have the input and the output, and your machine figures out the system depending on the network architecture you provide.

Both are solving the same problem most times, and extensive signal processing knowledge helps a ton in understanding what's happening/what should happen within a network.

And FWIW EE signal processing has most requirements for AI covered in their coursework, and makes good ML engineers minus production level coding.

2

tysam_and_co OP t1_j6h2qlh wrote

That's a good question, and I'm somewhat curious what you mean by HPC/MLHPC techniques or code optimizations. Do you mean something like distributed computing? (which that is an interesting rabbit-hole alone to get into....)

Regardless -- yep! I'm a professional in this industry, and there's a lot of detail underlying a ton of seemingly-simple changes (and potentially even more frustrating if simple, understandable changes shave off large chunks and swathes of what was previously the world record). So basically everything I'm doing is informed by, well, years of doing basically this exact same thing over and over again. Something that I've found is that the younger/newer ML engineers (myself included when I was at that point) are often really attracted to the "new shiny", when in reality, good HPC on a smaller scale is like writing a really nice, quiet, tight haiku. Less is more, but a single 'syllable' equivalent can make or break the whole thing.

Lots of people use models inefficiently. This model is still somewhat inefficient in its own ways, though it definitely I think is more efficient by far than most nearly all of the ones it's currently competing with. When I design a model, I'm thinking about keeping the GPU occupancy high, utilizing tensor cores as much as possible, mathematically fusing operations to reduce overhead, managing memory layout to make sure the right paths get activated in the GPU (like tensor cores, etc), and seeing if there are good approximations or alternatives to some things that are much more mathematically cheap (or if there are alternate paths with specialized kernels that I can boutique-design the network around).

I'll have to cut myself short early, but I'll leave you with a singular example, which is a technical breakdown which was behind what was in practice a very 'simple' change in the backend. I also made another comment in here, this reddit thread (https://www.reddit.com/r/MachineLearning/comments/10op6va/comment/j6h0z6b/?utm_source=share&utm_medium=web2x&context=3), with a technical breakdown behind one other very 'simple' change. Don't get pulled away by the shiny fancy techniques that are slow/etc, sometimes the simplest is the best!

Here's the breakdown: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomment-1379711156

Let me know if this answered your question at all or if you have any follow-ups, much love, cheers, and thanks! <3 :D :D :D :D :))))

2

Ch1nada OP t1_j6h1gdw wrote

Thank you! Tbh I really over-engineered it at first, trained a model to classify articles into sub-categories, then built a query around that to fetch contextualized videos from pexels, but it was really clunky (e.g., stock market could return something like someone buying fruit at the farmer's market). Currently, I have a 2 pools of videos I curate manually, and the content creator script just picks randomly either from the stocks pool, or crypto pool.

2

tysam_and_co OP t1_j6h0z6b wrote

Thanks for asking, great question! I'd say it's really hard to pick at this point -- mostly it's just a hardcore case of "do the basics and do them really, _really_ well" as best as I can, with a few smaller tricks along the way. There may be some much more exotic things later on, but experience has taught me to try to delay that for as long as is humanly possible! Plus, the bonus is that things get to be simpler. Arguably, some aspects of this code are actually simpler than the baseline, believe it or not!

That said, if I had to pick a trick, I think it would be 'upgrading' the whitening convolution to be 2x2 from 3x3 or so. I think that saved like maybe just over or around a full second and a half alone or so, when combined with the 'padding=0' change at the start. Most of the in-practice things here are pretty simple, but what's happening here is that we're projecting from the input image to a whitened feature space, the 3x3 convs are going to result in a 3*3*3 = 27 depth input feature space without any downstriding, this can be horribly slow as the spatially large layers always are the slowest compute-wise -- deeper layers without much spatial width or height are by comparison very snappy (correct me if I'm wrong, I think this has to do with the SIMD architecture of GPUs -- in any case, spatial stuff with 2d convolutions at least tends to be hilariously ineffecient).

Not padding cuts off a potentially expensive kernel call (I don't know if it's fused or not...), and reduces the height IIRC from 32x32->30x30. This is actually a deceptively large (roughly ~12%) savings in spatial pixel count, but not everything is lost as that lovely 2x2 convolution is still going to touch everything (I could theorize about the efficiency of processing the edges of pictures but I could also be horribly wrong so I'm going to keep my mouth shut here). So in any case, now summing it up, we move from a 3*3*3=27 dimensional input feature space to a new 2*2*3=12 dimensional input feature space, remove 12% of the pixels without directly necessarily deleting that information, and most importantly we only have to run with 2*2/3*3 = 4/9 = 44% of the input kernel cost.

And that's why I'm proud of that little trick. It's very unassuming, since it's just:

Conv(input_depth, output_depth, kernel_size=3, padding='same') -> Conv(input_depth, output_depth, kernel_size=2, padding=0)

Now of course there's a bit of a hit to accuracy, but the name of the game here is leverage, and that's what the squeeze-and-excite layers are for. They're very fast but add a huge amount of accuracy, though (and I unfortunately don't think I've noted this anywhere else) for some reason they are very sensitive to the compression dimension -- 16 here in this case.

Though to be frank, I started with squeeze-and-excite and got my accuracy increase, then pulled this off the shelf to 'cash in' the speed increase. I have been sitting on this one since before even the last release, I've found it's good to not be like (noise warning) https://www.youtube.com/watch?v=ndVhgq1yHdA on projects like these. Taking time to be good and slow is good!

I hope that helps answer your question, I know this was a really long answer, paradoxically I get far more verbose the more tired I get and poor next-day-me has to deal with it, lol.

Again, to get below 2 seconds, we're going to have to get more progressively fancy and "flashy" but for now, it's build a really, really freaking solid core of a network, then get into the more 'exotic' stuff. And even then, hopefully the more mundane exotic stuff while we're at it.

Hope that helps answer your question, feel free to let me know if you have any others (or follow-ups, or if this wasn't what you were looking for, etc)! :D

5