Recent comments in /f/MachineLearning

ahf95 t1_jan46jv wrote

They are certainly still used for RL (and other cases where you don’t have a gradient), but even in those contexts there have been modern advancements that cause the preferred algorithms to diverge from old-school genetic algorithms. For instance, things like Particle Swarm Optimization and the Cross Entropy Method have their conceptual origins in the similar sampling regimes as MCMC approaches, but they’ve become their own entities at this point, outperforming genetic algorithms, and really being unique and broad enough to get their own categories.

1

-EmpiricalEvidence- t1_jan0mqg wrote

Exactly due to the computational demands I don't think genetic algorithms have ever really been "alive", but with compute getting cheaper I could see it seeing success similar to the rise of Deep Learning.

Evolution Strategies as stabilizers without the genetic component are already being deployed quite well e.g. AlphaStar.

Jeff Clune was quite active in that area of research and he recently joined DeepMind.

https://twitter.com/jeffclune/status/1629132544255070209

1

nfmcclure t1_jamy7yz wrote

I don't think they are dead. Their popularity for NNs is much lower for sure.

In general, GAs can theoretically solve any problem (if you can formulate a fitness function), given long enough time. Because of that, I think they will always have some use cases.

2

lucidraisin t1_jamtx7b wrote

it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.

14

Dekans t1_jamokhr wrote

> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

...

> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.

4