Recent comments in /f/MachineLearning
MysteryInc152 t1_j60vz8p wrote
Reply to comment by FallUpJV in Few questions about scalability of chatGPT [D] by besabestin
OpenAI's models are still undertrained as well.
avd4292 t1_j60ufsr wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
It's hard to scale up, but check this new paper out: https://arxiv.org/pdf/2301.09515v1.pdf
programmerChilli t1_j60s9pz wrote
Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
Have you tried out PyTorch 2.0 compilation feature (i.e. torch.compile)? Might help a lot for evolutionary computation stuff.
Delicious-View-8688 t1_j60s6lt wrote
git for versioning code
dvc for versioning data (and other ML things)
mlflow for managing ml pipelines (overlaps with some parts of dvc)
conda for environment management (yes, it can be slow...)
HateRedditCantQuitit t1_j60rtsa wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
This isn't the whole answer, but GANs are super hard to train, while diffusion models are an instance of some much more well understood methods (MLE, score matching, variational inference). That leads to a few things:
- It's more reliable to converge (which leads to enthusiasm)
- It's easier to debug (which leads to progress)
- It's better understood (which leads to progress)
- It's simpler (which leads to progress)
- It's more modular (which leads to progress)
Hypothetically, it could even be that the best simple GAN is better than the best simple diffusion model, but it's easier to iterate on diffusion models, which means we'd still be more able to find the good ways to do diffusion.
tl;dr when I worked on GANs, I felt like a monkey hitting a computer with a wrench to make it work, while when I work on diffusion models, I feel like a mathematician deriving Right Answers™.
HateRedditCantQuitit t1_j60qzvg wrote
Reply to comment by dojoteef in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
I always see diffusion/score models contrasted against VAEs, but is there really a good distinction? Especially given latent diffusion and IAFs and all the other blurry lines. I feel like any time you're doing forward training & backwards inference trained with an ELBO objective, it should count as a VAE.
arg_max t1_j60qav1 wrote
Typically, if you're solver is not written in Pytorch/tensorflow itself you can't easily calculate gradients through them as your computational graph doesn't capture the solver. If your soler is also written in the framework and differentiable you might be able to just backpropagate through it though. Otherwise, the Neural ODE paper that was linked here a few times has an adjoint formulation that gives you the gradient wrt to the solver as a solution to another ode, but this is specific to their problem and won't apply to non-differential equations.
cdsmith t1_j60q0bs wrote
Reply to comment by Taenk in Few questions about scalability of chatGPT [D] by besabestin
I can only answer about Groq. I'm not trying to sell you Groq hardware, honestly... I just honestly don't know the answers for other accelerator chips.
Groq very likely increases inference speed and power efficiency over GPUs; that's actually its main purpose. How much depends on the model, though. I'm not in marketing so I probably don't have the best resources here, but there are some general performance numbers (unfortunately no comparisons) in this article, and this one talks about a very specific case where a Groq chip gets you a 1000x inference performance advantage over the A100.
To run a model on a Groq chip, you would typically start before CUDA enters the picture at all, and convert from PyTorch, Tensorflow, or a model in several other common formats into a Groq program using https://github.com/groq/groqflow. If you have custom-written CUDA code, then it's likely you've got some programming work ahead of you to run on something besides a GPU.
londons_explorer t1_j60m5ui wrote
Reply to comment by vivehelpme in Few questions about scalability of chatGPT [D] by besabestin
This isn't true.
The model generates 1 token at a time, and if you look at the network connection you can see it slowly loading the response.
I'm pretty sure the speed the answer is returned is as fast as openAI can generate it on their cluster of GPU's.
Reifiery t1_j60ki11 wrote
Might be helpful https://www.pgupta.info/data/talks/huawei23.pdf
arg_max t1_j60jz1r wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
Iterative refinement seems to be a big part of it. In a GAN, your network has to produce one image in a single forward pass. In diffusion models, the model actually sees the intermediate steps over and over and can make gradual improvements. Also, if you think about what the noise does, in the first few steps it will remove all small details and only keep low frequent, large structures. Basically, in the first steps, the model kind of has to focus on overall composition. Then, as the noise level goes down, it can gradually start adding all the small details. On a more mathematical level, the noise smoothes the distribution and widens the support in the [0,1]^D cube (D=image dimension, like 256x256x3). Typically people assume that this manifold is low-dimensional which can make sampling from it hard.
Some support for this claim is that people were able to improve other generative models like autoregressive models using similar noisy distributions. Also, you can run GANs to sample from the intermediate distributions which works better than standard GANs.
IntelArtiGen t1_j60jjfg wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
It's quite hard to answer these questions for neural networks. We don't really know if GANs are forever worse than Latent Diffusion Models, they are now, but previously they weren't, and perhaps in the future GANs will outperform LDMs. It seems that how we configure the denoising task now is better suited for text2img than how we configure GANs now.
A model usually outperforms another when it's more efficient in how it stores information in its weights. Successive conditioned denoising layers seem to be more efficient for this task, but it also requires a good enough perceptual loss, a good enough encoder, etc. We know that these networks could compete with GANs but maybe they were just not good enough before, or not combined in a good enough way.
NaturalGradient OP t1_j60jc3z wrote
Reply to comment by lucidraisin in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
Great to hear! I actually lead the CMA-ES effort and tried very hard to match the fine details of pycma so that the performance is comparable. If you run into any unexpected behavior please do open a Github issue or reach out to me directly. There's a lot of fine details in practical CMA-ES implementation, so I'd really like to know if I missed anything.
NaturalGradient OP t1_j60iyek wrote
Reply to comment by Ulfgardleo in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
It depends what you're trying to do :)
If you want to run GPU-accelerated neuroevolution in Brax or IsaacGym, then keeping everything on GPU is absolutely relevant. Similarly if you're trying to do MPC or any optimisation of an NN input, then its still very useful to be on the GPU. As you said, bench-marking is another place this GPU acceleration can be very helpful. Basically anywhere where the fitness evaluation isn't the only bounding factor.
For expensive/CPU-bounded fitness functions, we have other utilities too! For example, with a single flag you can distribute your fitness evaluation across multiple actors using ray. This means you can scale to an entire CPU cluster effortlessly!
Taenk t1_j60gdbl wrote
Reply to comment by cdsmith in Few questions about scalability of chatGPT [D] by besabestin
Do these also increase inference speed? How much work is it to switch from CUDA based software to one of these?
dojoteef t1_j60evd7 wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
I'd guess that it's an easier optimization problem. GANs are known to have stability issues during training, likely due to the adversarial formulation.
I think a more interesting question is why it also performs better than VAEs, since diffusion models also fall under the category of variational inference. Again I'd assume it's an easier optimization problem due to having a large number of denoising steps. Perhaps a technique like DRAW could match diffusion models if used with more steps? Not sure.
RemindMeBot t1_j60a563 wrote
Reply to comment by FirstBabyChancellor in [Discussion] Github like alternative for ML? by angkhandelwal749
I will be messaging you in 2 days on 2023-01-28 20:20:17 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
| ^(Info) | ^(Custom) | ^(Your Reminders) | ^(Feedback) |
|---|
FirstBabyChancellor t1_j60a0n5 wrote
!remindme 2 days
TopCryptographer402 t1_j609ecj wrote
Reply to [D] Simple Questions Thread by AutoModerator
Does anyone have resources on how to create a simple time series transformer for a classification task? I've been trying to build one for over a month now but I haven't had any luck. I'm trying to predict a binary outcome (0 or 1) for the next 100 time steps.
currentscurrents OP t1_j608oz5 wrote
Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
TL;DR:
-
In-context learning (ICL) is the ability of language models to "learn from example" to perform new tasks just based on prompting. These researchers are studying the mechanism behind ICL.
-
They show that the attention layers allow transformers to implement a gradient descent optimization process at inference time. This mechanism produces very similar results to explicit optimization through fine-tuning, but was itself learned by optimization through gradient descent.
-
Based on this finding they apply momentum, a technique known to improve optimizers, to transformer attention layers. This produces a small-but-consistent improvement in performance on all tested tasks. They suggest that there are more improvements to be made by explicitly biasing transformers towards meta-optimization.
This reminds me of some meta-learning architectures that try to intentionally include gradient descent as part of inference (https://arxiv.org/abs/1909.04630) - the difference here is that LLMs somehow learned this technique during training. The implication is pretty impressive: at enough scale, meta-learning just emerges by itself because it's a good solution to the problem.
Other researchers are looking into ICL as well, here's another recent paper on the topic: https://arxiv.org/abs/2211.15661
andreichiffa t1_j60625r wrote
So. First of all it’s not the size, or at least not only the size.
Before ChatGPT OpenAI experimented with InstructGPT, which at 6B parameters completely destroyed the 175B GPT3 when it came to satisfying users interacting with it and not being completely psycho.
Code-generating abilities start around 12B parameters (OpenAI codex), so most of things you are interacting with and are impressed by could be done with 12B parameters model. What really is doing heavy lifting for Chat-GPT is fine-tuning and guided generation to make it conform to user’s expectations.
Now, the model size allows for nice emerging properties, but there is a relationship between the dataset size and model size, meaning that without increasing the dataset, bigger model do nothing better. At 175B parameters, GPT-3 was already past that point compared to the curated dataset OpenAI used for it. And given that their dataset already contained CommonCrawl, it was pretty much all public writing on the internet.
They weren’t short by a bit - over a factor of 10x. Finding enough data to just finish training GPT-3 is a challenge already; larger models would need even more. That’s why they could dump code and more text into GPT-3 to create GPT-3.5 without creating bottlenecks.
Now, alternative models to GPT-3 have been trained (OPT175B or BLOOM), but at least for OPT175, it underperforms. OpenAI actually did a lot of data preparation, meaning that anyone who would want to replicate it would need to figure out the “secret sauce”.
Ulfgardleo t1_j603u8t wrote
Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
in my experience, this is never the bottleneck. rastrigin does not cost much to evaluate, real functions where you would consider evolution on, do. I did research in speeding up CMA-ES and in the end it felt like a useless exercise in matrix algebra for that reason.
Yes, in theory being able to speed-up matrix operations is nice, but doing stuff in higher dimensions (80 is kinda irrelevant computationally, even on a CPU) always has to fight against the O(1/n) convergence rate of all evo algorithms.
So all this is likely good for is benchmarking these algorithms in a regime that is practically irrelevant for evolution.
cruddybanana1102 t1_j601vly wrote
Someone has already mentioned Neural Ordinary Differential Equations, which is also the first thing that came to mind. There are also extensions to it, where one can use PDEs(Neural Hamiltonian Flows) or even stochastic DEs(Score-Based Generative Models) in the model. All of them covering different but overlapping use cases.
There are also techniques which use numerical solvers as blackboxes to perform model-order reduction of a complicated system of equations, or identifying slow modes, timescale decomposition, etc.
new_name_who_dis_ t1_j601m4q wrote
Reply to comment by fernandocamargoti in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
Gradient descent is also about optimization... You can optimize even neural networks with a bunch of different methods other than gradient descent (including evolutionary methods). They don't work as well but you can still do it.
fernandocamargoti t1_j60xagg wrote
Reply to comment by ReginaldIII in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
Well, what you talking about is some ways to use evolutionary algorithms to optimize the parameters of a ML model. But in my eyes, it doesn't mean it is ML. They both share a lot, but they aren't the same. For me, evolutionary algorithms is part of Meta Heuristics, which is part of AI (which ML is also part of). Different areas and sub areas of research do interact with each other. I just mean that the is part is a bit too much in this case.