Recent comments in /f/MachineLearning
madmax_br5 OP t1_j629re3 wrote
Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Right, but BPE is designed to compress alphabetic languages (multiple letters per word), whereas logographic languages are already compressed (one or more words per symbol, but more net symbols). I suppose I don't get the reason behind obsessing over efficiency at this step and why it is necessary. What is the relationship between vocabulary size and model computational requirements? If the model input is ultimately an embedding of a fixed number of dimensions, does the token vocabulary size really make much practical difference?
BeautyInUgly OP t1_j628n7e wrote
Reply to [D] Meta AI Residency 2023 by BeautyInUgly
my update: heard nothing back yet, will keep posting in this thread when / if I hear anything back, note I am probably a weak candidate tbh, don't have any publications
currentscurrents OP t1_j627rd0 wrote
Reply to comment by master3243 in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Meh, transformers have been around for like 5 years and nobody figured this out until now.
I think this mostly speaks to how hard it is to figure out what neural networks are doing. Complexity is irrelevant to the training process (or any other optimization process), so the algorithms they implement are arbitrarily complex.
(or in practice, as arbitrarily complex as the model size and dataset size allow)
endless_sea_of_stars t1_j627a9m wrote
Reply to comment by DigThatData in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Just rent out an AWS region for a month and you'll be good to go. Hold a couple bake sales to defray the cost.
gradientpenalty t1_j6278gc wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Its not a problem of unicode but the tokenizer method they are using BPE. I don't forsee any solution in the future cause there aren't many high paying customer
TLDR; english use the least token because it provides the highest compression ratio in bytes to token size.
cdsmith t1_j626c0c wrote
Reply to comment by gradientpenalty in Few questions about scalability of chatGPT [D] by besabestin
I honestly don't know the price or terms of use, for this or any other company. I'm not in sales or marketing at all. I said you don't need to be Google; obviously you have to have some amount of money, whether you're buying a GPU or some other piece of hardware.
CKtalon t1_j625s3n wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The tokenization just saw a predominantly English corpus, so it naturally tokenised most common English words and left words from other languages in different sub word form.
They could increase the vocabulary size to something like 250000 from the current 30+k, but that would require retraining
madmax_br5 OP t1_j625fr2 wrote
Reply to comment by ww3ace in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
The token counts in my example were copied directly from OpenAI's tokenizer, so if not unicode-based, it is still representing logographs very inefficiently.
ww3ace t1_j624na0 wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
I don’t think any modern SOTA language model uses Unicode for tokenization.
VisceralExperience t1_j623jjy wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
"secretly" is what I was referring to
currentscurrents OP t1_j623hb4 wrote
Reply to comment by robdogcronin in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Yeah, but I want AI now. Not in 40 years when computers are 1000x better.
Also I'm not sure computers will be 1000x better in 40 years, Moore's law isn't what it used to be.
HateRedditCantQuitit t1_j621uj8 wrote
Reply to comment by Zealousideal_Low1287 in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
Isn't reconstructing the input exactly what the denoising objective does?
currentscurrents OP t1_j620shg wrote
Reply to comment by VisceralExperience in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
What? "Meta-optimization" is not a very anthropomorphic term, and certainly not something laymen would understand. Their approach is technical in nature and describes the limitations of current models in explicit detail.
robdogcronin t1_j61zvce wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
That's the bitter lesson
DigThatData t1_j61zv3l wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Compute Is All You Need
VisceralExperience t1_j61znkf wrote
Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
The amount of blatant anthropomorphism that comes from AI researchers is so disgusting. Laymen knowledge about the state of the field is already twisted enough from reality, and the researchers are 100% to blame. Seriously, I'd like to see papers getting rejected for this delusional framing of results.
Mefaso t1_j61zim5 wrote
Reply to comment by NaturalGradient in [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
>If you want to run GPU-accelerated neuroevolution in Brax or IsaacGym, then keeping everything on GPU is absolutely relevant
Do you have evidence for that?
I would assume that running brax rollouts for example would take 100x as long as the actual cmaes
rjromero t1_j61ytag wrote
Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
This is incredible research. Finally a lead on how we might get to "true" one shot / few shot learning.
metric_logger t1_j61xa7a wrote
Comet.ml does everything you listed! Free for individuals!
master3243 t1_j61wtpt wrote
Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
This is great work in collaboration with Microsoft Research. I'll have to read more than just the abstract and quickly skimming over it.
My only slight annoyance is the word "Secretly" in the title, I just feel a better word would be "implicitly" that would also be less "clickbait'-y
ApprehensiveNature69 t1_j61wedt wrote
Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
This is really awesome!
binheap t1_j61v2f2 wrote
Reply to comment by lucidrage in Few questions about scalability of chatGPT [D] by besabestin
If you believe them, model safety is why there isn't a general public release. LLMs (including chatGPT) tend to be bad at factual accuracy and can easily hallucinate. It's not obvious that you can work LLMs into a product where accuracy matters a lot. It might hurt brand image in ways that Google could not tolerate but OpenAI can tolerate.
lucidrage t1_j61u7zt wrote
Reply to comment by ObjectManagerManager in Few questions about scalability of chatGPT [D] by besabestin
>that's called a
>
>search engine
>
>.
like bing? :D
Google isn't known to develop and keep new products. When that google engineer leaked that "sentient AI" model, why didn't google beat the news by releasing a google-gpt with search engine capabilities?
With their 150k engineers, I doubt they lack the resources to build a user-friendly version of their LLM so how come they've been sitting on their hands the whole time?
gradientpenalty t1_j61tko2 wrote
Reply to comment by cdsmith in Few questions about scalability of chatGPT [D] by besabestin
Okay, so where can I buy it as a small startup for under 10k without signing any NDA for using your proprietary compiler. As far as I can see, we are all still stuck with Nvidia after 10B of funding for all these "AI" hardware startup.
ElectronicCress3132 t1_j629tix wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
> implement a gradient descent optimization process at inference time
Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?
Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.