Recent comments in /f/MachineLearning
[deleted] t1_j6hg3ri wrote
Reply to [D] Sparse Ridge Regression by antodima
[removed]
tysam_and_co OP t1_j6hfdaj wrote
Reply to comment by LeanderKu in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
And, safe to say, there's stuff I'm not sharing here (yet?) that I've found as a result of that. Some hyperparameters are more network-specific, some are dataset specific. And some behave in ways just weirdly enough that you might get an entirely new adaptive method out of it... ;))))
I hadn't thought about that in the exact words you'd put it for a long while, but I think you're right very much so! It is quite a fun game, and very interesting to play in its own right. There's very much this bizzare, high-dimensional "messy zen" to it all. ;D
Thanks again for your comment, it warmed my heart and made me smile seeing it. Have a good evening/night/etc! :D :))) <3 <3 :D
LeanderKu t1_j6hd6hp wrote
Reply to comment by JamesBaxter_Horse in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Well, this is also an assumption. It would be interesting which lessons do translate and which won’t. I wouldn’t dismiss it so quickly. Also, It’s a fun game to play and interesting in its own!
[deleted] t1_j6hayqr wrote
Reply to comment by [deleted] in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
[deleted]
Acceptable-Cress-374 t1_j6haw2s wrote
Reply to [P] AI Content Detector by YoutubeStruggle
I tested this with text-davinci-03.
Prompt: Write a post about liking fruity icecream, in the style of a reddit post. Make some minor grammar mistakes while writing the post.
> hey everyone, so I'm totally loving this fruity icecream lately! I got it from a store near my house and its been so good. Its really sweet but not overly sickeningly sweet, like some other ice cream. Plus the fact that its fruity makes it even better in my opinion. Does anyone else like fruity icecream too? It'd be awesome to hear some of your thoughts on it!
This site gave me this:
85 %
HUMAN-GENERATED CONTENT
[deleted] t1_j6hauon wrote
Reply to comment by JamesBaxter_Horse in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
[deleted]
JamesBaxter_Horse t1_j6h8tib wrote
If I understand correctly, you're hyperparameter training with the intent to minimise training speed. What do you see as the purpose of this? Presumably, all you're achieving is successfully minimising the inductive space and optimising the learning parameters so as to converge as quickly as possible, but these results are completely specific to cifar and would not be reproducible on a different dataset.
yauangon t1_j6h8s2s wrote
Reply to comment by Anvilondre in [D] Simple Questions Thread by AutoModerator
I will give it a shot :D Thank you a lot :D
tysam_and_co OP t1_j6h8nhh wrote
Reply to comment by batrobin in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D
I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).
tysam_and_co OP t1_j6h8h8n wrote
Reply to comment by oh__boy in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
I'm sure there's the classical leakage of the val set into the network design via val set performance for super tight tweaks. Thankfully some of the compression from simplifying things in the network seems to be a defending principle against that, but if I was hard-pressed, doing a final round with k-fold validation would probably be really good for final tuning runs.
There might be a CIFAR10 test set (see, goes to show how little I know about the CIFAR10 split, lol), and there has been a lot of work put into studying some aspects (and flaws, like mislabeling, etc) of the structure of CIFAR10 in general.
Mainly -- this task is primarily a proxy for "how fast can we ingest the information encoded in dataset A into a compressed form B so that it will perform adequately well on a task for a dataset we call C". It starts getting down to the engineering barebones of information flow at that point, and a lot of other really involved stuff that I don't want to break into while our training times are 4-5 seconds or so.
The concepts, if any usable ones, distilled from this, would apply to nearly any dataset and any problem -- say, training GPT on an absolutely massive oneshot dataset, for example. The math is the same anywhere you go.
I don't know if that answers your question or not, my apologies if I didn't understand properly.
batrobin t1_j6h8d9d wrote
Reply to comment by tysam_and_co in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.
I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.
oh__boy t1_j6h7zc2 wrote
Reply to comment by tysam_and_co in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Do you have a dedicated development set apart from your test set to tune these hyperparameters? Or am I missing the point that this is not meant to be a general improvement but rather to see just how fast you can train with on this single dataset.
a_khalid1999 OP t1_j6h6e57 wrote
Reply to comment by amxdx in [D] AI Theory - Signal Processing? by a_khalid1999
I see. As far as research is concerned, I guess, EE's in ML shouldn't be seen as "changing their field"
amxdx t1_j6h51rh wrote
Reply to [D] AI Theory - Signal Processing? by a_khalid1999
I studied EE Signal processing in Bachelors and Masters, and now work as a machine learning engineer.
In signal processing, you take the signal, define a desirable output and try to create a system to get to that output.
In ML, you typically have the input and the output, and your machine figures out the system depending on the network architecture you provide.
Both are solving the same problem most times, and extensive signal processing knowledge helps a ton in understanding what's happening/what should happen within a network.
And FWIW EE signal processing has most requirements for AI covered in their coursework, and makes good ML engineers minus production level coding.
[deleted] t1_j6h44hu wrote
[removed]
tysam_and_co OP t1_j6h2qlh wrote
Reply to comment by batrobin in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
That's a good question, and I'm somewhat curious what you mean by HPC/MLHPC techniques or code optimizations. Do you mean something like distributed computing? (which that is an interesting rabbit-hole alone to get into....)
Regardless -- yep! I'm a professional in this industry, and there's a lot of detail underlying a ton of seemingly-simple changes (and potentially even more frustrating if simple, understandable changes shave off large chunks and swathes of what was previously the world record). So basically everything I'm doing is informed by, well, years of doing basically this exact same thing over and over again. Something that I've found is that the younger/newer ML engineers (myself included when I was at that point) are often really attracted to the "new shiny", when in reality, good HPC on a smaller scale is like writing a really nice, quiet, tight haiku. Less is more, but a single 'syllable' equivalent can make or break the whole thing.
Lots of people use models inefficiently. This model is still somewhat inefficient in its own ways, though it definitely I think is more efficient by far than most nearly all of the ones it's currently competing with. When I design a model, I'm thinking about keeping the GPU occupancy high, utilizing tensor cores as much as possible, mathematically fusing operations to reduce overhead, managing memory layout to make sure the right paths get activated in the GPU (like tensor cores, etc), and seeing if there are good approximations or alternatives to some things that are much more mathematically cheap (or if there are alternate paths with specialized kernels that I can boutique-design the network around).
I'll have to cut myself short early, but I'll leave you with a singular example, which is a technical breakdown which was behind what was in practice a very 'simple' change in the backend. I also made another comment in here, this reddit thread (https://www.reddit.com/r/MachineLearning/comments/10op6va/comment/j6h0z6b/?utm_source=share&utm_medium=web2x&context=3), with a technical breakdown behind one other very 'simple' change. Don't get pulled away by the shiny fancy techniques that are slow/etc, sometimes the simplest is the best!
Here's the breakdown: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomment-1379711156
Let me know if this answered your question at all or if you have any follow-ups, much love, cheers, and thanks! <3 :D :D :D :D :))))
a_khalid1999 OP t1_j6h20iq wrote
Reply to comment by CriticalTemperature1 in [D] AI Theory - Signal Processing? by a_khalid1999
Thanks, will look into it
a_khalid1999 OP t1_j6h1z02 wrote
Reply to comment by Main_Mathematician77 in [D] AI Theory - Signal Processing? by a_khalid1999
Thanks
a_khalid1999 OP t1_j6h1win wrote
Reply to comment by mr_birrd in [D] AI Theory - Signal Processing? by a_khalid1999
RL sure seems like a fun field to get into
a_khalid1999 OP t1_j6h1uli wrote
Reply to comment by MrAcurite in [D] AI Theory - Signal Processing? by a_khalid1999
Guess we shouldn't be taking the EE background for granted
a_khalid1999 OP t1_j6h1rxb wrote
Reply to comment by Autogazer in [D] AI Theory - Signal Processing? by a_khalid1999
Seems to make sense considering how convolutions are simply multiplications in the Fourier domain
batrobin t1_j6h1o7u wrote
I am surprized to see most of the work you have done are on hyperparameter tunings and model tricks. Have you tried any HPC/MLHPC techniques, profiling or code optimizations? Are they in a future roadmap, not the goal of this project, or are there just not much to improve in that direction?
Ch1nada OP t1_j6h1gdw wrote
Reply to comment by doctorjuice in [P] Automating a Youtube Shorts channel with Huggingface Transformers and After Effects by Ch1nada
Thank you! Tbh I really over-engineered it at first, trained a model to classify articles into sub-categories, then built a query around that to fetch contextualized videos from pexels, but it was really clunky (e.g., stock market could return something like someone buying fruit at the farmer's market). Currently, I have a 2 pools of videos I curate manually, and the content creator script just picks randomly either from the stocks pool, or crypto pool.
tysam_and_co OP t1_j6h0z6b wrote
Reply to comment by jobeta in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Thanks for asking, great question! I'd say it's really hard to pick at this point -- mostly it's just a hardcore case of "do the basics and do them really, _really_ well" as best as I can, with a few smaller tricks along the way. There may be some much more exotic things later on, but experience has taught me to try to delay that for as long as is humanly possible! Plus, the bonus is that things get to be simpler. Arguably, some aspects of this code are actually simpler than the baseline, believe it or not!
That said, if I had to pick a trick, I think it would be 'upgrading' the whitening convolution to be 2x2 from 3x3 or so. I think that saved like maybe just over or around a full second and a half alone or so, when combined with the 'padding=0' change at the start. Most of the in-practice things here are pretty simple, but what's happening here is that we're projecting from the input image to a whitened feature space, the 3x3 convs are going to result in a 3*3*3 = 27 depth input feature space without any downstriding, this can be horribly slow as the spatially large layers always are the slowest compute-wise -- deeper layers without much spatial width or height are by comparison very snappy (correct me if I'm wrong, I think this has to do with the SIMD architecture of GPUs -- in any case, spatial stuff with 2d convolutions at least tends to be hilariously ineffecient).
Not padding cuts off a potentially expensive kernel call (I don't know if it's fused or not...), and reduces the height IIRC from 32x32->30x30. This is actually a deceptively large (roughly ~12%) savings in spatial pixel count, but not everything is lost as that lovely 2x2 convolution is still going to touch everything (I could theorize about the efficiency of processing the edges of pictures but I could also be horribly wrong so I'm going to keep my mouth shut here). So in any case, now summing it up, we move from a 3*3*3=27 dimensional input feature space to a new 2*2*3=12 dimensional input feature space, remove 12% of the pixels without directly necessarily deleting that information, and most importantly we only have to run with 2*2/3*3 = 4/9 = 44% of the input kernel cost.
And that's why I'm proud of that little trick. It's very unassuming, since it's just:
Conv(input_depth, output_depth, kernel_size=3, padding='same') -> Conv(input_depth, output_depth, kernel_size=2, padding=0)
Now of course there's a bit of a hit to accuracy, but the name of the game here is leverage, and that's what the squeeze-and-excite layers are for. They're very fast but add a huge amount of accuracy, though (and I unfortunately don't think I've noted this anywhere else) for some reason they are very sensitive to the compression dimension -- 16 here in this case.
Though to be frank, I started with squeeze-and-excite and got my accuracy increase, then pulled this off the shelf to 'cash in' the speed increase. I have been sitting on this one since before even the last release, I've found it's good to not be like (noise warning) https://www.youtube.com/watch?v=ndVhgq1yHdA on projects like these. Taking time to be good and slow is good!
I hope that helps answer your question, I know this was a really long answer, paradoxically I get far more verbose the more tired I get and poor next-day-me has to deal with it, lol.
Again, to get below 2 seconds, we're going to have to get more progressively fancy and "flashy" but for now, it's build a really, really freaking solid core of a network, then get into the more 'exotic' stuff. And even then, hopefully the more mundane exotic stuff while we're at it.
Hope that helps answer your question, feel free to let me know if you have any others (or follow-ups, or if this wasn't what you were looking for, etc)! :D
malayboar t1_j6hgv7t wrote
Reply to comment by AlphaBookGuess in [D] CVPR Reviews are out by banmeyoucoward
Wow...