Recent comments in /f/MachineLearning

mostlyhydrogen OP t1_j7238p8 wrote

As you probably know, ANN search often returns irrelevant data. How might I iteratively refine the search with human feedback: marking samples as "relevant" or "irrelevant" and repeating the search.

I've done a lit search and haven't found anything, maybe because I am using the wrong keywords.

1

asarig_ OP t1_j71wbqs wrote

Reply to comment by SatoshiNotMe in [R] Graph Mixer Networks by asarig_

Of course, MLP-Mixers is a new approach first developed as image classification and was developed independently by Google and Oxford researchers in May 2021.

The MLP-Mixer, also known simply as "Mixer", is a type of image architecture that doesn't incorporate convolutions or self-attention. Instead, it relies solely on the use of multi-layer perceptrons (MLPs) that are repeatedly applied either to different spatial locations or feature channels.

Instead of Transformers, which are normally applied on the Graph, in this work, I tried to use Mixers as a new kernel method on graphs, which aims to find out how it performs with linear complexity, avoiding the O(n***^(2)***) complexity of Transformers

5

atharvat80 t1_j71u3oa wrote

If you want to take the top down approach I'd recommend that you start by learning what transformers are. Transformers were originally intended for language modelling so if you look up a NLP lecture series like Stanford CS224n they cover that in detail form a NLP perspective, it should be helpful regardless. Or you can check out CS231n they have a whole lecture on attention, transformers and ViT. Start there and look up the stuff thats unclear from there.

Lmk of you'd like me to link any other resources, I'll edit this later. Happy learning!

13

prototypist t1_j71p3d6 wrote

You can fine-tune language models on a dataset, and that's essentially how people have been typically doing NLP with transformers models? It's more recent that research has been having success with RL for these kinds of tasks. So whatever rationale and answers you get here, the main reason is that they were doing supervised learning before and the RL people started getting better results.

1

cunth t1_j71ovks wrote

Getting a good data set to train a model is usually the most time-consuming task. You need breadth amd depth of content so your model doesn't overfit and work for just a handful of narrow use cases.

Supervised learning algorithms need labeled data (e.g. classification tags) and this is traditionally done with people. If that can be done with AI, you can complete this 100x faster and probably more accurately.

1

Jurph t1_j71nymu wrote

I recommend diving in, but getting out a notepad and writing down any term you don't understand. So if you get two paragraphs in and someone says this simply replaces back-propagation, making the updated weights sufficient for the skip-layer convolution and you realize that you don't understand back-prop or weights or skip-layer convolution ... then you probably need to stop, go learn those ideas, and then go back and try again.

For deep neural nets, back-propagation, etc., there will be a point where a full understanding will require calculus or other strong mathematic principles. For example, you can't accurately explain why back-prop works without a basic intuition for the Chain Rule. Similarly, activation functions like ReLu and sigmoid require a strong algebraic background for their graphs to be a useful shorthand. But you can "take it on faith" that it works, treat that part of the system like a black box, and revisit it once you understand what it's doing.

I would say the biggest piece of foundational knowledge is the idea of "functions", their role in mappings and transforms, and how things similar to Newton's Method are meant to work to get approximate solutions after several steps. A lot of machine learning is based on the idea of expressing the problem as a composed set of mathematical expressions that can be solved iteratively. Grasping the idea of a "loss function" that can be minimized is core to the entire discipline.

18

ooonurse t1_j71mt0q wrote

In fairness, every single time I've seen someone use grammarly they were extremely intelligent people with English as their second or third language. I also know one person who uses it because of dyslexia, which has nothing to do with intelligence. Be careful about shaming people for using software commonly used for accessibility.

1

CatalyzeX_code_bot t1_j71kw7i wrote

2

mongoosefist t1_j71dbhq wrote

Differential privacy methods work in a way that's quite similar to the denoising process of diffusion models already. The problem is that in most Differential privacy methods they rely on the discreteness of data. The latent space of diffusion models is completely continuous, so there is no way to tell the difference between similar images, and thus you can't tell which ones are from the training data if any at all.

For example, if you're pretty sure the diffusion model has memorized an oil painting of Kermit the frog, there is no way for you to say with any reasonable amount of certainty whether images you are denoising that turn out to be oil paintings of Kermit are from actual pictures, or from the distribution of oil paintings overlapping with the distribution of pictures of Kermit from the latent space, because there is no hard point where one transitions to the other, or a meaningful difference in density between the distribution

2

jimmymvp t1_j71cgkw wrote

There is a trick how you can get away with gradually expanding your latent dimension with normalising flows, if you assume that the dimensions are independent to a certain point, then you sample from a base distribution and concatenate in the middle of the flow.

Again, MCMC sampling, simulation based inference are examples. Imagine you have an energy function that describes the distribution (you don't have data), how do you sample from this distribution? You would do some MCMC, how would you arrive to a good proposal distribution to make the MCMC algorithm more efficient? You would fit the proposal based on some limited data that you have or inductive biases such as certain invariances etc.

3