Recent comments in /f/MachineLearning

teenaxta t1_j62mz4o wrote

Customer ID is useless so obviously it will be dropped. Now the actions he did is a bit tricky.

if actions are discrete classes, then i think you should break up the column into sub classes and then one hot encode the actions.

I cant really understand why you need LSTM here. Do you have a sequence data or any sort of temporal component ? If you have to use LSTM you can just set your sequence length to 1 and essentially use it as a NN. But that makes no sense honestly. Would be much better to use something like XGboost

3

guava-bandit t1_j62mxs8 wrote

For the separate columns question: depending on the importance that those in isolation would have on whether a customer would buy a product or not, you might want a feature per action and each with a flag value on whether the user did it or not. This is more something you’ll have to think about and to test out. If you do end up doing a feature per action, you might want to look at some regularisation for your logistic regression parameters, as maybe some of the actions are not as useful in predicting a good outcome.

For the training bit (.fit()), you need to pass in to the fit function your prepared dataset X used for training in 2D format and then for the y argument you need to pass in your class target data. I must say that the error you get confuses me a bit though.

I hope this is giving you some pointers though, and opening up the discussion to more useful input :)

3

Luminite2 t1_j62kcmp wrote

Your tl;dr is a bit circular. English has the highest compression ratio because the tokenizer was trained to optimize compression on mostly English data. One could train a BPE-based tokenizer that compresses some other language really well but works poorly on English if that made sense for the intended application.

6

moschles t1_j62iaxi wrote

GANs produce an image "cut from the whole cloth" at once.

Diffusion models are using a trick -- wherein between rounds of incremental noise removal, they perform a super resolution round.

Technically speaking, you could start from GAN output, and then take it through rounds of super-resolution. The result would look a lot like what diffusion models produce. This leaves a question as to how the new details would be guided, ( or more technically), what the super resolution features would be conditioned upon. If you are going to condition them on text embeddings, you might as well condition the whole process on the same embedding . . . now you just have a diffusion model.

A second weakness of GANs is the narrowness of their variety. When made to produce vectors corresponding to a category "dog" , they tend to produce to nearly exactly the same dog each time.

−2

jobeta t1_j62eibb wrote

IMHO the buzz is mainly around the UX provided by ChatGPT. Most LLMs are not that easily accessible and most people never get to experience any aha moment with them, so most people don't care. As for Google, I do think there is real but not immediate danger for their business model. The big issue for them is that 60% of their revenue comes from ads in Google search, so rolling out an amazing ChatGPT equivalent could potentially hurt their business. They would have to rethink the entire model. For now and AFAIK, ChatGPT doesn't provide web links so it doesn't feel like it is trying to sell you something. If Google if going to use one of their SOTA LLM and build a conversational AI out of it and make it available for free, surely they have to consider the implications for Alphabet as a whole.

3

CKtalon t1_j62c6t5 wrote

GPT can already model multiple languages with 30k vocabulary, just at the cost of high token count per (non-English) word. So increasing to 200k, will ease most of the burden. It won’t completely make other languages be at parity with English definitely since there’s ultimately a hard limit to that language’s corpus.

1

madmax_br5 OP t1_j62b2jq wrote

Yes, this is my point - the tokenizer OpenAI uses is optimized for european languages as it is an alphabetic tokenizer designed for consonants and vowels. I'm wondering why they don't move away from BPE all together and just increase the vocabulary size to give each symbol in each logographic language its own token. This problem must eventually be solved for multilingual models to have similar cost and capabilities across languages.

So the real question is what is the best tokenization approach to use for a truly multilingual model, and why?

0

currentscurrents OP t1_j62auto wrote

Yes, but I don't want to create too much optimism; meta-learning was also a promising lead when Schmidhuber wrote his PhD thesis.

Honestly, I'm not sure much has changed since then other than we got more compute power. Transformers are reportedly equivalent to 1990s meta-learning networks except that they run better on GPUs, and GPUs have gotten powerful enough to run them at very large scale.

25

master3243 t1_j62aoln wrote

You're right they've been around for 5 years (and the idea for attention even before that) but almost every major conference still has new papers coming out giving more insight into transformers (and sometimes algorithms/methods older than it)

I just don't want to see titles flooded with terms like "secretly" or "hidden" or "mysterious", I feel it replaces scientific terms with less scientific but more eye-catchy ones.

Again I totally understand why they would choose this phrasing, and I probably would too, but in a blog post title not a research paper title.

But once again, the actual work seems great and that's all that matters really.

12

madmax_br5 OP t1_j62anqr wrote

What would be the practical impacts of a larger vocabulary? There seems to ultimately be no way around this if you want a truly multilingual model; your vocabulary needs to be at least as large as the full set of symbols in all the languages in the corpus. But it would seem that the computational costs of this would be limited to the very beginning and very end of the model, which seems computationally insignificant compared to the attention layers that operate in vector space. In fact, doesn't a larger input vocabulary result in fewer net tokens to vectorize in the first place? If the vector space of the embedding has a fixed dimensionality (which I believe it does in the case of GPT3), then isn't each token the same mathematical size once embedded?

1

PassingTumbleweed t1_j62anc3 wrote

You could solve the problem you describe at the tokenization level without moving away from Unicode, which is more about how text is encoded for storage and transmission purposes.

For example let's say you still represent your text as Unicode at rest, but you have a tokenizer that budgets its vocab space s.t. the average number of tokens per sentence is the same across languages (or whatever your fairness criteria is)

2