Recent comments in /f/MachineLearning
CKtalon t1_j62n9yw wrote
Reply to comment by data-drone in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
About 10-12 times more then the tokens seen.
Vivid-Ad6077 t1_j62n4k1 wrote
https://wandb.ai/site - Weights & Biases does everything you listed, from versioning code, datasets and models to vizualizing experiments and managing hyperparameters and even running hyperparameter search. It can be used to fully reproduce and recreate the entire state of your ML workflow. It's free for individuals and academics.
data-drone t1_j62n3b9 wrote
Reply to comment by CKtalon in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
How much more training they need?
teenaxta t1_j62mz4o wrote
Customer ID is useless so obviously it will be dropped. Now the actions he did is a bit tricky.
if actions are discrete classes, then i think you should break up the column into sub classes and then one hot encode the actions.
I cant really understand why you need LSTM here. Do you have a sequence data or any sort of temporal component ? If you have to use LSTM you can just set your sequence length to 1 and essentially use it as a NN. But that makes no sense honestly. Would be much better to use something like XGboost
guava-bandit t1_j62mxs8 wrote
For the separate columns question: depending on the importance that those in isolation would have on whether a customer would buy a product or not, you might want a feature per action and each with a flag value on whether the user did it or not. This is more something you’ll have to think about and to test out. If you do end up doing a feature per action, you might want to look at some regularisation for your logistic regression parameters, as maybe some of the actions are not as useful in predicting a good outcome.
For the training bit (.fit()), you need to pass in to the fit function your prepared dataset X used for training in 2D format and then for the y argument you need to pass in your class target data. I must say that the error you get confuses me a bit though.
I hope this is giving you some pointers though, and opening up the discussion to more useful input :)
LetWrong1932 t1_j62mke8 wrote
Reply to comment by drakesword514 in [D] CVPR Reviews are out by banmeyoucoward
think it's quite good, try ur best on improving 3 and he/she and the 5 will probably convince 2 :)
danielgafni t1_j62mh4o wrote
Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
How does it compare to evojax? A huge deal there is training all the networks in the population in parallel. This gives absolutely massive speedups as you can imagine. Can evotorch do it?
Luminite2 t1_j62kcmp wrote
Reply to comment by gradientpenalty in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Your tl;dr is a bit circular. English has the highest compression ratio because the tokenizer was trained to optimize compression on mostly English data. One could train a BPE-based tokenizer that compresses some other language really well but works poorly on English if that made sense for the intended application.
lookatmetype t1_j62j0t3 wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
is there anything he hasn't done?
moschles t1_j62iaxi wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
GANs produce an image "cut from the whole cloth" at once.
Diffusion models are using a trick -- wherein between rounds of incremental noise removal, they perform a super resolution round.
Technically speaking, you could start from GAN output, and then take it through rounds of super-resolution. The result would look a lot like what diffusion models produce. This leaves a question as to how the new details would be guided, ( or more technically), what the super resolution features would be conditioned upon. If you are going to condition them on text embeddings, you might as well condition the whole process on the same embedding . . . now you just have a diffusion model.
A second weakness of GANs is the narrowness of their variety. When made to produce vectors corresponding to a category "dog" , they tend to produce to nearly exactly the same dog each time.
CKtalon t1_j62hmsr wrote
Reply to [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
Before people get their hopes up, BLOOM and OPT are known to be seriously undertrained (not Chinchilla-optimal, BLOOM more so than OPT), so it’s possible that most of the weights were useless to begin with. The results of this paper seem to imply that.
jobeta t1_j62eibb wrote
IMHO the buzz is mainly around the UX provided by ChatGPT. Most LLMs are not that easily accessible and most people never get to experience any aha moment with them, so most people don't care. As for Google, I do think there is real but not immediate danger for their business model. The big issue for them is that 60% of their revenue comes from ads in Google search, so rolling out an amazing ChatGPT equivalent could potentially hurt their business. They would have to rethink the entire model. For now and AFAIK, ChatGPT doesn't provide web links so it doesn't feel like it is trying to sell you something. If Google if going to use one of their SOTA LLM and build a conversational AI out of it and make it available for free, surely they have to consider the implications for Alphabet as a whole.
madmax_br5 OP t1_j62d75y wrote
Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
I get that now, thanks! Not an ML expert so this is very helpful!
plocco-tocco t1_j62cm2x wrote
Reply to comment by royalemate357 in [D] score based vs. Diffusion models by Individual-Cause-616
What's used more in practice?
CKtalon t1_j62c6t5 wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.
PassingTumbleweed t1_j62bzdk wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
You can totally do that. There are tricks to reduce memory usage, too, such as the embedding factorization used in ALBERT.
The best part is, none of these options are precluded by Unicode. Unicode in fact has nothing to do with it!
madmax_br5 OP t1_j62bm6c wrote
Reply to comment by PassingTumbleweed in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Why not just increase the vocabulary size so that each symbol in a logographic language has one token? Logograms are already a tokenization scheme, in effect! Would love to understand more about the practical effects of a larger vocabulary on model compute requirements.
madmax_br5 OP t1_j62b2jq wrote
Reply to comment by float16 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.
So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?
currentscurrents OP t1_j62auto wrote
Reply to comment by rjromero in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Yes, but I don't want to create too much optimism; meta-learning was also a promising lead when Schmidhuber wrote his PhD thesis.
Honestly, I'm not sure much has changed since then other than we got more compute power. Transformers are reportedly equivalent to 1990s meta-learning networks except that they run better on GPUs, and GPUs have gotten powerful enough to run them at very large scale.
master3243 t1_j62aoln wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
You're right they've been around for 5 years (and the idea for attention even before that) but almost every major conference still has new papers coming out giving more insight into transformers (and sometimes algorithms/methods older than it)
I just don't want to see titles flooded with terms like "secretly" or "hidden" or "mysterious", I feel it replaces scientific terms with less scientific but more eye-catchy ones.
Again I totally understand why they would choose this phrasing, and I probably would too, but in a blog post title not a research paper title.
But once again, the actual work seems great and that's all that matters really.
madmax_br5 OP t1_j62anqr wrote
Reply to comment by CKtalon in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
What would be the practical impacts of a larger vocabulary? There seems to ultimately be no way around this if you want a truly multilingual model; your vocabulary needs to be at least as large as the full set of symbols in all the languages in the corpus. But it would seem that the computational costs of this would be limited to the very beginning and very end of the model, which seems computationally insignificant compared to the attention layers that operate in vector space. In fact, doesn't a larger input vocabulary result in fewer net tokens to vectorize in the first place? If the vector space of the embedding has a fixed dimensionality (which I believe it does in the case of GPT3), then isn't each token the same mathematical size once embedded?
PassingTumbleweed t1_j62anc3 wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
You could solve the problem you describe at the tokenization level without moving away from Unicode, which is more about how text is encoded for storage and transmission purposes.
For example let's say you still represent your text as Unicode at rest, but you have a tokenizer that budgets its vocab space s.t. the average number of tokens per sentence is the same across languages (or whatever your fairness criteria is)
float16 t1_j62agci wrote
Reply to [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
Isn't this just the result of using certain tokenizers? Using Chinese as an example, no reasonable tokenizer developed with Chinese in mind would give you 17 tokens. You'd have maybe 6 to 8:
- 你好
- ,
- 我
- 是
- 个
- 高个子
...depending on whether it thinks 你好 and 高个子 should be split.
cdsmith t1_j62a3yv wrote
Reply to comment by lucidrage in Few questions about scalability of chatGPT [D] by besabestin
I'm not aware of any effort to build it into Keras, but Keras models are one of the things you can easily convert to Groq chips using groqflow.
Thanos_nap OP t1_j62nsnj wrote
Reply to comment by teenaxta in [P] Building a LSTM based model for binary classification by Thanos_nap
Oh yes customer ID will be dropped that was just for identification. As for why we need LSTM..that's because they just want it with LSTM because LSTM is the "new" thing here. That's all..i have explained them it's not really needed but obviously top management knows better.