Recent comments in /f/MachineLearning

curiousshortguy t1_j6ak1cj wrote

Why are you using euclidiean distance? Use cosine distances. The former cares about vector magnitue, the latter doesn't. As a general rule of thumb for comparing vector embeddings, you don't care about magnitude, at best, that typically captures document length.

Do you have more than product titles, such as product descriptions? Where do you get the user queries from? Do you use a default tokenizer for BERT?

3

marcingrzegzhik t1_j6aic0m wrote

If you are looking for product-query similarity, you could try using a Word2Vec model. You can train a Word2Vec model on your dataset, and then use the model to find the most similar words for each product title and user query. This should give you a better understanding of the similarity between the two.

You can also try using an embedding-based approach, such as using an embedding layer in a neural network. This would enable you to learn more complex relationships between product titles and user queries.

You could also try using a matrix factorization technique such as Singular Value Decomposition (SVD) or Non-Negative Matrix Factorization (NMF). These methods can help you to identify latent features in your dataset, which can be used to generate better recommendations.

Hope this helps!

2

visarga t1_j6aeq98 wrote

Scaling model size continues but obtaining more organic data is over, we are at the limit. So the only way is to generate more, but they need humans in the loop to check quality. It's also possible to generate data and verify with math, code execution, simulation or other means. And AnthropicAI showed a pure LLM way to bootstrap more data (RLAIF or Constitutional AI).

I bet OpenAI is just taking the quickest route now. For example, we know that using 1800 tasks in pre-training makes the model generalise to many more tasks at first sight (Flan T5). But OpenAI might have 10,000 tasks to train their model on, hence superior abilities. They also put more effort in RLHF, so they got a more helpful model.

Besides pure organic text, there are other sources - transcribed or described videos is a big one. They released the Whisper model and it's possible they are using it to transcribe massive video datasets. Then there are walled gardens - social networks generate tons of text, not the best quality though. There is also a possibility to massage data collection as game play and get people to buy into providing exactly what they need.

12

Anvilondre t1_j6aa0er wrote

Honestly I don't think transformers are worth it for any kind of TS or tabular data (and there's research showing that). But if you really want to try, I had a good success with this library. It makes it essentially a few-liner to run tons of transformer and other architectures on any kind of tabular data. You may also want to check out HuggingFace model repo for quick solutions.

1

albertzeyer t1_j6a8qvq wrote

I mean in the research community, and also all the big players who actually have speech recognition in products, like Google, Apple, Microsoft, Amazon, etc.

Whisper is nice for others. However, as an AED model, it has some disadvantages over an RNN-T model. E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec). Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

1

pandasiloc t1_j6a4a2n wrote

I never said I didn’t remember how to differentiate multivariate functions - my point was that equating conceptual mathematical knowledge and the ability to implement a specific application of such concepts in a time-constrained and stressful situation is inappropriate.

A lot of things need to come together in answering a question like this - remembering that the chain rule is the key concept in backprop the first place, knowledge of how to implement matrix algebra in code, knowing the commonly-used loss functions, how to compute their derivatives, and how to represent the differentiation in code, etc. None of these things is complicated on its own; the difficulty arises in bringing everything together in a small amount of time. It’s fair to expect people in the field to intuitively remember what is going on but on the spot implementation in under 30 minutes requires a level of rigor that is unrealistic for even a competent person who does not have the theory fresh in their memory.

You keep using the term ‘smart’ and I don’t know what you mean by this. Your last statement is just an assertion without argument, one you’ve repeated throughout your comments but I see no reason to believe, given the above.

2

ant9zzzzzzzzzz t1_j6a37a1 wrote

Is there research about order of training examples, or running epochs on batches of data rather than full training set at a time?

I was thinking about how for people we learn better if focus on one problem at a time until grokking it, rather than randomly learning things in different domains.

I am thinking like train some epochs on one label type, then another, rather than all data in the same epoch, for example.

This is also related to state full retraining, like one probably does professionally - you have an existing model checkpoint and retrain on new data. How does it compare to retraining on all data from scratch?

1

Featureless_Bug t1_j69xpo0 wrote

Oh, a fellow mathematician. Look, I graduated from Cambridge 6 years ago, but I could still prove the fundamental theorem of algebra analytically or with Galois theory (I still remember the general ideas of both proofs I think), so I guess it depends on a person. But FTA is also a much more complicated thing to prove than the chain rule, and you don't even need to prove it to know how to use it. And sorry, if you don't remember how to differentiate multivariable functions, then you are an extraordinarily lousy mathematician. And if you know how to differentiate multivariable functions and if you are smart, you should be able to quickly come up with an implementation for backprop even if you don't remember anything else

0