Recent comments in /f/MachineLearning
adt t1_j8v1vlp wrote
Reply to [D] Compare open source LLMs by President_Xi_
For models, see my up-to-date list of models:
For performance, Papers with code keep good benchmarks:
fasttosmile t1_j8v03xu wrote
Reply to comment by drinkingsomuchcoffee in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
> I don't know what hackable means. You haven't defined it. I'm going to use the most generous interpretation to mean, you can modify it without impacting other places. Well you can do that if it's centralized, just copy paste it into your file and then edit it- that's no excuse to completely ban centralization! Alternatively decompose the centralized function more and only use the pieces you need.
Your definition of hackable is almost it. What’s missing is that being decentralized makes things much, much easier to understand because the code is very straightforward and doesn’t have to take 10 different things into account.
You cant just copy paste a file if it’s centralized, you’ll have to copy paste multiple, and the main issue is it’s gonna take a while to understand which ones (and you'll have to modify the imports etc., unless you copy the entire repo! are you seriously suggesting that lmao) and what’s safe to modify inside of them. Decomposing is just going to make things more complicated for no gain.
Deep learning is about the details, and whenever you start breaking things apart and putting the details in different corners that’s how you end up with code that is hard to understand and people making mistakes and not understanding what’s going on.
> Maybe it should cause 100s of failures if it's a breaking change (a bug). That's a pretty good sign you really did screw something up.
It's a syntax/interface/some-other-not-fundamental bug. A real bug would have already been spotted when checking the test-set performance .
> No it's not. If new code uses a battle tested core, I don't have to review those parts as thoroughly. If it's copy pasted, I still have to review it and make sure they didn't copy an old version with bugs or slightly modified it and broke something. Sounds like this is common as many people have complained about dozens of bugs!
The way code is shown to be correct is by getting SOTA results. If it does that it is "battle tested". If it didn't do that no one would even think of merging it in the first place.
> Yep, you've identified a place where you shouldn't try to fit every idea under a single "Attention" class. That's just common sense programming, not an argument against writing good shared functions or classes.
It is an argument against having shared classes. At the same time, sure you can have some shared code, Huggingface does that.
> It can sometimes. But not always. Having one massive file named main.py is not more readable than a well split program. This seems like basic common sense to me, but here's an actual paper on the subject:
There is an important distinction that you're ignoring here. Having semantically separate objects in one file is indeed confusing. But if put everything related to the model in one file that simplifies things and reduces the working memory people require to read your code.
> Then why does the Bert module have changes as recent as this week with changes from dozens of authors going back years?
The recent change for Bert is some inference Interfaxe code which has to be kept common across all models. That’s their decision, I wouldn’t even do that, just make kwargs mandatory imo.
> Maybe you should check your assumptions before you make a fundamental decision (you know, basic engineering). There's plenty of forked libraries that are not modified and are forked for archival purposes. Nor should you cater to a small minority if most people aren't doing this.
Everyone in deep learning likes to gamble on making some tweaks to the model hoping they’ll get the next ICLR oral. Why else would they care about modifying the model code?
--
I suggest you go read some modeling code from different frameworks, one example is fairseq. I like fairseq, I think it's well done considering it's aims and constraints. But you're crazy if you think it's easier to understand and modify the code for some specific model than in huggingface. Here's the link to fairseq's roberta, you'll need to understand look at a dozen files to see what's happening. In constrast, huggingface is one file.
Spent too much time on this already, not gonna reply anymore.
pc_4_life t1_j8uzqxo wrote
Huggingface is an incredible library that makes NLP tasks trivial. If you can't get the same code to work on multiple machines, that's on you. Learn how to use docker and containerize your code.
drinkingsomuchcoffee OP t1_j8uvzdr wrote
Reply to comment by dahdarknite in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
This is such a terrible attitude to have. This isn't about money at all.
You don't pay for many services. Does this mean they should be able to treat you like garbage? Should Google be able to lock you out of all your services because their automated system falsely accused you? By your logic, you don't pay so you have no right to be annoyed.
HuggingFace is a for profit company. They will be asking for your money now or in the future. This isn't a bad thing, they need to eat too.
By even existing, HuggingFace has disincentivized possibly more competent devs from creating their own framework. That's fine, but is a very real thing. In fact it's pretty common for a business to corner a market at a loss and then ratchet up prices.
Finally you may work for a company that chooses HuggingFace and you will be forced to use the library whether you want to or not.
dahdarknite t1_j8usx4b wrote
It’s literally software that you don’t pay a dime for. Ok there’s bugs, but guess what? It’s fully open source so you can fix them.
As someone who maintains an open source project in my spare time, there’s nothing that irks me more than entitled users.
borisfin t1_j8uqpud wrote
Reply to [R] [P] OpenAssistant is a fully open-source chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so. by radi-cho
Excited for the future of dynamic intelligent systems, Ones that can influence, retrieve and alter the state of the web using the same tools we do. What a world we are living, soon most of the operations done over the web will be ai based.
borisfin t1_j8upuj9 wrote
The huggingface devs will clean their libraries over time. It's not fair denounce the value and convenience they provide for new users. What other comparable options even are there?
borisfin t1_j8upatv wrote
Reply to [D] Compare open source LLMs by President_Xi_
There is some interesting comparisons found in the flan t5 paper. Checkout the paper "Scaling Instruction-Finetuned Language Models". Hope this helps.
drinkingsomuchcoffee OP t1_j8unlkp wrote
Reply to comment by fasttosmile in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
Alright, I have a bit of time so I'll address a few things.
>You need to understand that there is a trade-off between centralizing [...] verses keeping it hackable that is unavoidable.
I don't know what hackable means. You haven't defined it. I'm going to use the most generous interpretation to mean, you can modify it without impacting other places. Well you can do that if it's centralized, just copy paste it into your file and then edit it- that's no excuse to completely ban centralization! Alternatively decompose the centralized function more and only use the pieces you need.
Now onto the blog post.
>If a bug is found in one of the model files, we want to make it as easy as possible for the finder to fix it. There is little that is more demotivating than fixing a bug only to see that it caused 100 failures of other models.
Maybe it should cause 100s of failures if it's a breaking change (a bug). That's a pretty good sign you really did screw something up.
>Similarly, it's easier to add new modeling code and review the corresponding PR if only a single new model file is added.
No it's not. If new code uses a battle tested core, I don't have to review those parts as thoroughly. If it's copy pasted, I still have to review it and make sure they didn't copy an old version with bugs or slightly modified it and broke something. Sounds like this is common as many people have complained about dozens of bugs!
>We assume that a significant amount of users of the Transformers library not only read the documentation, but also look into the actual modeling code and potentially modify it. This hypothesis is backed by the Transformers library being forked over 10,000 times and the Transformers paper being cited over a thousand times.
Maybe you should check your assumptions before you make a fundamental decision (you know, basic engineering). There's plenty of forked libraries that are not modified and are forked for archival purposes. Nor should you cater to a small minority if most people _aren't_ doing this.
> Providing all the necessary logical components in order in a single modeling file helps a lot to achieve improved readability and adaptability.
It can _sometimes_. But not always. Having one massive file named `main.py` is not more readable than a well split program. This seems like basic common sense to me, but here's an actual paper on the subject: http://www.catb.org/esr/writings/taoup/html/ch04s01.html
>Every time we would have to have asked ourselves whether the "standard" attention function should be adapted or whether it would have been better to add a new attention function to attention.py. But then how do we name it? attention_with_positional_embd, reformer_attention, deberta_attention?
Yep, you've identified a place where you shouldn't try to fit every idea under a single "Attention" class. That's just common sense programming, not an argument against writing good shared functions or classes.
>Once a machine learning model is published, it is rarely adapted or changed afterward.
Then why does the Bert module have changes as recent as this week with changes from dozens of authors going back years?
https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert
This is irrefutable hard evidence against your argument.
> Sylvain Gugger, found a great mechanism that respects both the single file policy and keeps maintainability cost in bounds. This mechanism, loosely called "the copying mechanism", allows us to mark logical components, such as an attention layer function, with a # Copied from <predecessor_model>.<function> statement
Ok so the programmer you mentioned before is going to "break 100s of tests" when she changes this ad-hoc C-preprocessor knock off. You're still doing "DRY" you're just doing it how C programmers did it 30 years ago, in a much more complicated manner.
If anyone here works at HuggingFace, please forward this to the author of that article.
syb3ria t1_j8u9p09 wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Thanks for sharing your work OP. How do you compare it to Bloom?
Tawa-online t1_j8u8wtz wrote
so apart from Hugging Face what are the other alternatives you would suggest using?
Seankala t1_j8u8ogp wrote
I hear my colleagues complain about the same thing. And then go back to doing AutoModel.from_pretrained(sdfsdf).
[deleted] t1_j8u6aks wrote
Reply to comment by threevox in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
[deleted]
HumanSpinach2 t1_j8u513j wrote
Reply to comment by bernhard-lehner in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
That one is already in use :P
drinkingsomuchcoffee OP t1_j8u09ez wrote
Reply to comment by [deleted] in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
Not an argument.
[deleted] t1_j8u05yg wrote
Reply to comment by drinkingsomuchcoffee in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
[removed]
drinkingsomuchcoffee OP t1_j8tw3yt wrote
Reply to comment by fasttosmile in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
There's so many contradictions in that blog post and fallacies, I don't even know where to begin. I think I'll let empirical evidence do the talking for me, aka many people agreeing with my post.
[deleted] t1_j8tw3in wrote
[deleted]
fasttosmile t1_j8tudd4 wrote
Reply to comment by drinkingsomuchcoffee in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
You don't need to explain what DRY is. You need to understand that there is a trade-off between centralizing (creating shared functions/classes in modules that many other modules import from) a codebase verses keeping it hackable that is unavoidable.
[deleted] t1_j8trmcn wrote
[deleted]
martianunlimited t1_j8tm2qd wrote
This is an ELI5 explanation as to why we use noise and conditionally denoise the noise with the text encoder: Look at the clouds, and I tell you that I see an elephant in the clouds. It is easier to imagine the elephant in the clouds than if i tell you to imagine that there is an elephant in the piece of white paper.
(the less ELI5 explanation is that the entropy going from noise to an image is lower than that of from a uniform image) If you want to see that for yourself, with a bit of programming knowledge you can write your own diffuser pipeline to skip the noise adding stage and try img2img from a blank image. (it's literally just ~3 lines of edits)
(side note: someone brought up a similar question but in a different vein, (removing the random seed)
[deleted] t1_j8tc0nl wrote
[removed]
MustachedSpud t1_j8t65fh wrote
Reply to comment by ChuckSeven in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that
ChuckSeven t1_j8t5r5m wrote
Reply to comment by MustachedSpud in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.
Franck_Dernoncourt t1_j8v206a wrote
Reply to [D] Compare open source LLMs by President_Xi_
For summarization: Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto. Benchmarking Large Language Models for News Summarization. arXiv:2301.13848.