Recent comments in /f/MachineLearning
curiousshortguy t1_j61silr wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
This is cool, thanks for sharing
wind_dude t1_j61rnjt wrote
huggingface
flyer2403 t1_j61nnvc wrote
Check out Dagshub!
ReginaldIII t1_j61nlno wrote
Reply to comment by fernandocamargoti in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
Trying to force these things into a pure hierarchy sounds nothing short of an exercise in pedantry.
And to what end? You make up your own distinctions that no one else agrees with and you lose your ability to communicate ideas to people because you're talking a different language to them.
If you are so caught up on the "is a" part. Have you studied any programming languages that support "multiple inheritance" ?
currentscurrents OP t1_j61ndkl wrote
Reply to comment by lucidraisin in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Thanks for the link!
I think it's interesting that they spent so much time in the 90s trying to make meta-learning work, and now it appears emergently just from throwing scale at the problem.
[deleted] t1_j61n377 wrote
Reply to comment by fernandocamargoti in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
[deleted]
royalemate357 t1_j61k9vy wrote
Reply to comment by Individual-Cause-616 in [D] score based vs. Diffusion models by Individual-Cause-616
the speed and quality of score based/diffusion depends on what sampler you use. If youre using euler's method to solve the ODE for example, that might be slower than some of the newer methods developed for diffusion models, like tero karass' ODE solvers. AFAIK there isnt consensus on what the best sampler to use is though.
i dont think it affects training convergence much though since its more or less the same objective.
lucidraisin t1_j61h7lf wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
and one more paper along same lines! https://arxiv.org/abs/2212.07677
[deleted] t1_j61h1lt wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
[deleted]
bo_peng OP t1_j61fdtp wrote
Reply to comment by Gody_Godee in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng
No. It's highly competitive.
youngintegrator t1_j61dfqk wrote
Reply to [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
Is there any reason you'd like a contrastive algorithm? (intra-class discrimination?)
Barlow twins showed to work quite well with lower batches (32) and HSIC-SSL is a nice variant on this style of learning if you only care about clusters. Im sure simsiam is fine too (avoid BYOL for small batches).
In terms of contrastive approaches, methods that avoid any "coupling" mentioned in DCL for the negative terms will work with smaller batch sizes (contrastive estimates converge to mle assuming large noise samples). This is seen in the spectral algorithm or in align-uniform. These work because they ignore the comparing the representations from the same augmented samples. SWAV also does this by contrastive prototypes which are basically free variables which don't have gradients that conflict with any alignment goal. I think it's fair to say that algorithms with LSE transforms are less stable for small batch sizes since the gradients will be biases to randomly coupled terms. With sufficiently many terms this coupling matters less.
From what i've noticed, methods that avoid comparing the augmented views of the same base sample will require slightly more tuning to get things just right. (align + weight * diversity)
​
Notes: NNCLR is nicer than moco imo. VicReg is good but is a mess to finetune. I am assuming youre using a CNN and have omitted transformer and masked based algorithms.
Zealousideal_Low1287 t1_j6191sq wrote
Reply to comment by HateRedditCantQuitit in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
I guess for it to really count as a variational autoencoder you need to be reconstructing the input
curiousshortguy t1_j617zzd wrote
The keyword you want, similar to DevOps where Github plays a role as the code storage, is MLOps, and within that you want to look for data and model management and versioning. There are quite a number of companies offering various aspects of that, see for example this random infographic: https://adataanalyst.com/wp-content/uploads/2021/05/Infra-Tooling3.png
pythonpeasant t1_j614dq6 wrote
Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
THIS IS HUGE!!!!
Please go back to the AttentionNeuron and AttentionAgent papers and retrain them on GPU with big population sizes!
_poisonedrationality t1_j60zrrk wrote
Why is this downvoted? Seems like a decent question.
Expensive-Track t1_j60z9ho wrote
Reply to comment by LetWrong1932 in [D] CVPR Reviews are out by banmeyoucoward
Same 🥲🥲
LetWrong1932 t1_j60z742 wrote
Reply to comment by Expensive-Track in [D] CVPR Reviews are out by banmeyoucoward
wish they would or else i would have to spend the most nervous month of my life lol
Delicious-View-8688 t1_j60ykt5 wrote
Reply to comment by ikkeweer in [Discussion] Github like alternative for ML? by angkhandelwal749
Yes, it's true. I have not tried using mamba with mlflow - maybe it integrates, maybe it doesn't. MLflow docs at the time of my reading indicated it works with conda or docker only.
Individual-Cause-616 OP t1_j60yi8b wrote
Reply to comment by royalemate357 in [D] score based vs. Diffusion models by Individual-Cause-616
So do you think it makes a difference in practice, I.e. sampling speed and quality, convergence etc
Expensive-Track t1_j60yf7a wrote
Reply to comment by LetWrong1932 in [D] CVPR Reviews are out by banmeyoucoward
Not sure myself but I doubt they'll make anything available before the final decision
ikkeweer t1_j60ych9 wrote
Reply to comment by Delicious-View-8688 in [Discussion] Github like alternative for ML? by angkhandelwal749
Try mamba if you struggle with conda being slow, its a drop in replacement.
ObjectManagerManager t1_j60y1rn wrote
OpenAI's LLM is special because it's open to the public. That's it. Other tech companies' internal LLMs are likely better. Google has a whole database of billions of websites and indexes directly at their disposal; I'm quite confident that they can outperform ChatGPT with ease. If Google was really afraid of ChatGPT running them out of business, they'd just release a public API for their own, better model. And they have a monopoly over the internet in terms of raw data and R&D; it would be virtually impossible for anyone else to compete.
Besides that, the whole "Google killer" thing is overreactive, IMO. The public api for ChatGPT doesn't retrain or even prompt-condition on new public internet data. So if you ask it about recent news, it'll spit out utter garbage. An internal version reportedly does seek out and retrain on new public internet data. But how does it find that data? With a neat tool that constantly crawls the web and builds large, efficient databases and indexes. Oh yeah---that's called a search engine.
So even if end users start using LLMs as a substitute for search engines (which is generally not happening at the moment, and it seems unlikely to be a concern in the age of GPT-3, despite what many people believe), most LLM queries will likely be forwarded to some search engine or another for prompt conditioning. Search engines will not die---they'll just have to adapt to be useful for LLM prompt conditioning in addition to being useful to end users.
royalemate357 t1_j60xuup wrote
there's an implementation of score-based models from the paper that showed how score based models and diffusion models are the same here: https://github.com/yang-song/score_sde_pytorch
imo their implementation is more or less the same as a diffusion model, except score based models would use a numerical ODE/SDE solver to generate samples instead of using the DDPM based sampling method. it might also train on continuous time, so rather than choosing t ~ randint(0, 1000) it would be t ~ rand_uniform(0, 1.)
NeoKov t1_j60xeip wrote
Fig. 8.5 mentions “brown line” for b) but line appears to be black.
lucidrage t1_j61so7l wrote
Reply to comment by cdsmith in Few questions about scalability of chatGPT [D] by besabestin
>convert from PyTorch, Tensorflow, or a model in several other common formats into a Groq program
Are there any effort spend in adding a plugin for a high level framework like keras to automatically use groq?