Recent comments in /f/MachineLearning

suflaj t1_j8qxasd wrote

That's more of an issue of you searching. You mention sentiment analysis, for example, but it is a problem that is considered to be solved for years. There is no novelty you could do here besides a bigger model.

Obviously you need to stop looking at what people have done, and start looking at what in their process of doing something they didn't do or did poorly. One such thing is tokenization of text. You can't tell me that it's all figured out.

5

teenaxta t1_j8qvnx0 wrote

I think this has more to do with probability, the sum of all random variables approaches a gaussian distribution. We can prove it using Central limit theorem. So what that really means is that the noise can map all sorts of information. Also when you add noise consistently, at one point you reach the normal distribution however, the noise pattern at hand is unique. Think of it as this way, 0,0 have a mean of 0 while -1,1 also have a mean of 0. The unique noise pattern actually contains useful information where as if you were to create a blank canvas, your generator would have no idea about what to generate from it for it is a many to one mapping. The additive noise process is a unique mapping

1

hfnuser0000 t1_j8qoshn wrote

I am interested in the theoretical aspect of how your model work. Says transformers, you have tokens that attend to other tokens. In the case of RNNs, a piece of information can be preserved for later uses but with a cost of reducing memory capacity for other information and once the information is lost, it's lost forever. So I think the context length of a RNN scale linearly with the memory capacity (and indirectly with the number of parameters), right?

1

farmingvillein t1_j8qipd4 wrote

Let's think step by step:

You:

> I don't think the Related Works section of that paper provides any useful references.

Your own response to the question that was posed:

> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521

There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":

> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement

> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer

I never said this, so I'm not sure what your argument is.

> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)

Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.

If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:

> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).

(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)

−1

teb311 t1_j8qdjqv wrote

It’s not popular to do “online learning” for a variety of reasons. u/CabSauce gave a nice list. One reason I wanted to add was that many models are exposed to relatively uncontrolled input and that can backfire badly. Google “Microsoft Tay Twitter” for a cautionary tale. Garbage in garbage out; letting your model learn in an uncontrolled environment risks inputting (lots of) garbage, and sometimes even malicious/adversarial data. Making matters worse, since the garbage affects the model in real time the actively-getting-worse predictions just get made/published/used in a production setting.

In most cases the upside to continuous learning is small compared to batched releases, but it makes a lot of stuff harder and more risky.

1