Recent comments in /f/MachineLearning

MustachedSpud t1_j8t25bb wrote

Not true, in any case with convolution, attention, or recurrence, which are most modern applications. In all of these cases the activation count grows with how often weights are reused as well as with batch size. Those dominate optimizer memory usage unless you used a tiny batch size.

That's why checkpointing can be useful. This paper does a solid job covering memory usage: https://scholar.google.com/scholar?q=low+memory+neural+network+training+checkpoint&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1676575377350&u=%23p%3DOLSwmmdygaoJ

2

lostmsu t1_j8stb1b wrote

Love the project, but after reading many papers I realize, that the lack of verbosity in formulas is deeply misguided.

Take this picture that explains RWKV attention: https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-formula.png

What are the semantics of i, j, R, u, W, and the function σ? It should be obvious from the first look.

1

dojoteef t1_j8sqm4i wrote

I commend what Huggingface is trying to do (be the source for the latest models that is consistent and easy to use), but every time I've used the library I've had to tackle bugs that were very time consuming to pinpoint, which is exacerbated by the structure of the code. The worst bugs have been subtle heisenbugs: the code seemed to work most of the time, but failed at other times. The heisenbugs are what made me stop using Huggingface altogether, unless it's my only option.

For example, I ran into a bug that only manifested when downloading a specific pretrained model for a task, which in turn downloads a config file that had a bug in the config. As a user it was super difficult to know where the source of the bug was without extensive spelunking. I've had many similarly difficult to diagnose issues each time I've used the Huggingface ecosystem.

I understand that what you're tasked with as a company is a huge undertaking for such a small team. Maybe splitting the package into a "stable" package and a "nightly" package could help (with stable being extensively bug tested more like an Ubuntu LTS release). My guess is that your team is likely too small to support that approach while adding new features at the same speed.

14

drinkingsomuchcoffee OP t1_j8sm1y7 wrote

Thank you for replying. I apologize for the harsh tone, and was hoping to phrase it as a wake up call that people are reading the code and they do care about quality.

Do continue to avoid inheritance. In fact, probably ban inheritance unless it's only one layer deep and inheriting from an abstract base class.

But don't misunderstand DRY. DRY is not about compressing code as much as possible. That's code golfing. DRY is about having one place for information to live, that's it. If you see a dev creating a poorly named function or abstraction to reduce 5 lines of duplicate code, that's not DRY, that's just bad code.

You can achieve DRY by using code generators as you mention, but splitting things into separate modules is also fine. A code generator is DRY because the generator is the point of truth for the information, even if it creates "duplicate" code. This is what a real understanding of DRY is.

People wanting to "hack" on code do not mind about having to copy a few folders. If you have a beautiful module of pure functions for calculating statistics, it is flat out stupid to copy+paste it into every folder to be more "hackable". Dont do this. Instead factor these out into simple pure modules.

6

narsilouu t1_j8sckwb wrote

Hijacking highest answer.
Disclaimer, I work at HF.

First of all, thanks for stating things that go wrong. This is the only means we have to get better (we are working with our own tools, but we cannot possibly use in all the various ways our community uses them, and so we cant fix every issue since were simply not aware of them all).

For all the issues you mention above, have you tried opening issues when you encountered your problems ? Were usually keen on answering promptly, and while I cannot promise things will move your way (there s many tradeoffs in our libs), at least that helps inform the relevant people.

Just to give you an overview we have 3 things we re trying to achieve.

- Never introduce breaking change. (Or very rarely, like when something is super new, and we realize its hurting users rather than helping we feel ok to break things. If something is really old, we cannot break it since people rely on it even if something is somewhat buggy).
- Add Sota models as fast as possible (and with the most options possible). That requires help from the community, but also reusing tools that already exists, which sometimes requires creativity on our end, to make widely different codebases in a somewhat consistent way. Most codebases from research don t try to support widely different architectures (theres only a handful) so many things are hardcoded which have to be changed, some bugs are in the original code which we have to copy into our codebase to be somewhat consistent (like position_ids start at 2 for roberta https://github.com/huggingface/transformers/issues/10736)

- And have a very hackable codebase. Contrary to most beautiful code with DRY being the dogma, on the contrary transformers tries to be hackable instead. This is because of the origin of research heavy users, which dont want to spend 2h understanding inheritance of classes and where is that code that does X to the input tensor for them to create a new layer. That means that transformers at least is highly duplicated code (we even have an internal cookie cutter tool to maintain copies as easily as possible).

The consequence for this, is that you have clever idea X to improve upon Whisper lets say, you should be able to copy paste the whisper folder and get going. While it might seem odd for some, it is still a design choice, which comes with pros and cons like any design choice.

And just to set things straight. We dont try to shovel our hub into our tools, we have a lot of testing to make sure local models work all the time, we actually rely on it in several internal projects.
Breaking changes is a very big concern of ours. Subtle breaking changes are most likely unintentional (please report them !).

For reinventing things existing into other libraries, do you have example in mind. We re very careful about the use of our time, and also the amount of dependencies we rely on. Adding a dependency for is_pair function is not something we like to do. If the dependency is too large for what we need we dont need it. If we cant have the functionality in reasonable time, then its going to me mostly optional dependency.

Thanks for reading this to the end.
And for all readers, please rest assured we are continuously trying to have the best code given our 3 constraints above. Any issue or pain, no matter how trivial please report, it does help us improve. And our open source and free code, may not be the best (we re aware of some warts) but please please, never doubt we re trying to do the best. And do not hesitate to contribute to make it better if you feel like you know better than us (and you could definitely be right !)

71

eigenham t1_j8sbr6j wrote

It's an arms race. Do Russia and the US really know that each other's nukes work? Or would they rather just not find out. It's like that... will these patents hold up in court? Well if I have enough of them and it's my 20k patents vs your 15k patents, we're probably going to settle based on the likelihood that enough will hold up that we have proportional mutually assured financial destruction.

The crappy part here is that a patent is essentially useless and inaccessible for the individual inventor. A bigger entity will dwarf them. In the end it comes down to money, and it's just a matter of understanding the nature of the investment, which is not as concretely defined as the general public might think.

1

MustachedSpud t1_j8sacz8 wrote

They might be thinking in a different direction than me, but the majority of Memory use during training is not from the model weights or optimizer state in most cases. It comes from tracking all the activations of the training batch. If you think about a cnn, each filter gets used across the whole image so you will have many more activations than filters. So optimizer memory savings has very limited benefits

3

farmingvillein t1_j8s7ygo wrote

This...is pretty astounding. Just have the grace to admit you were wrong, and move on.

> Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest

Then how can you possibly say:

> I don't think the Related Works section of that paper provides any useful references.

?

This is hardcore trolling. You can, and frequently do, do better than this.

You are literally pushing posts that are factually incorrect, and that you either know are factually incorrect, or are too lazy to validate either way.

This is the type of thing which blows up post quality in this sub.

> Giving someone a random reference and telling them to manually crawl the literature is not helpful.

This...is ridiculous. This is--traditionally--a very academic-friendly sub. This is how research works. "Here is where you can start a literature review on a bundle of related papers" is an extremely classic response which is generally considered helpful to complex and nuanced questions.

And underlying issue is actually very complex, as evidenced in part by the fact that your references do not actually answer the question. "Go read related works" can be obnoxious when there are a single one or two papers that do answer the question--but that is not the case here.

> In contrast, the two references I provided directly bore on the question

No they did not. They did not touch at all upon Transformers versus RNNs, which was the question. You've chosen to cherry-pick one slice of the problem and declare victory.

> It's not a strawman.

You don't seem to understand what a strawman is. Strawman:

> an intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent's real argument.

I was not making this argument. You were making this argument. QED, this a strawman.

2

gwern t1_j8s2du5 wrote

> There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest, and no, I did not recurse down n deep in a breadth-first search. I read the Related Works of that paper, as I said ("I don't think the Related Works section of that paper"), noted that they were a bunch of memory-related papers which might or might not cite the actually relevant research I had in mind, but life was too short to queue up a dozen papers just to check their RW when I already knew some useful ones. Giving someone a random reference and telling them to manually crawl the literature is not helpful. In contrast, the two references I provided directly bore on the question, they didn't maybe cite papers which might bury something relevant in a footnote or cite papers which might someday answer the question...

> I never said this, so I'm not sure what your argument is.

I was pointing out why it was irrelevant to bring up a paper which "compares w/ and w/o memory." Mildly interesting but such a comparison cannot show what was asked about the effective memory of RNNs. Of course it is better to have (any) memory than not.

> which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

Dai would in fact have been useful, had you referenced it in your original comment. Unless you mean, 'vaguely gestured in the direction of a paper which has 50+ references with 35 in the RW section alone, any of which could have been relevant and where the relevant benchmarking of Dai was not highlighted in the paper to begin with, nor is the relative context work mentioned in the abstract of Dai but buried at the end of the paper (with the RNN results hidden inside a table) so you just have to know it's already there, and claimed you 'reference it'.' Then sure, yeah, that was a useful reference. Thanks for the input.

> If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman.

It's not a strawman. It's not obvious a priori that Transformers would work so much better or that RNN histories fade out so fast, which is why it had to be empirically established that the history fades out completely, as opposed to any of the other reasons that RNNs could underperform (maybe they have history but can't learn a good algorithm exploiting their memory, say, or they could but they are poorly optimized - there are so many ways for NNs to break) and people were surprised by how well Transformers work. It is completely understandable that OP would expect RNN history to work better than it does, and would want some hard citeable evidence that it works so badly that Transformers, with their apparently brutal hard cutoff, wind up having much closer to 'infinite context' than RNNs themselves.

Thus, it's useful to provide references showing that. (Not references to unspecified references which may or may not show that - gl.)

1

redflexer t1_j8ryw02 wrote

Actually, i find this notion harmful. I consider senior PhD students to be able to assess whether an idea in their field is novel, feasible, and in the right scope given fixed resources. I would never expect that from Master students. That does of course not mean that students can’t have great ideas, but it’s not mandatory for a degree.

1