Recent comments in /f/MachineLearning

MrSpotgold OP t1_j4rqx4q wrote

This software is a nightmare for anyone in the teaching business (whether secondary school or higher education) where assessments is based on essays. I'm not kidding: a nightmare. We are going to bring up kids who will not be able write a comprehensive text simply because we lack the means to check that they wrote it themselves, and therefore we must abandon the assessment method altogether. It's that bad.

0

bo_peng OP t1_j4rht4i wrote

RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM).

Almost all such "linear transformers" are bad at language modeling, but RWKV is the exception. The basic idea is a bit similar to https://arxiv.org/abs/2105.14103. Then I added lots of new ideas :)

12

navillusr t1_j4rhitt wrote

The distinctions you’re drawing, pixels vs selenium output and browser vs os, are far less significant than the complexity of the tasks (step-by-step vs entire processes). What they’ve achieved is strictly harder for humans than what you are testing. We can argue whether perception or planning are harder for current technology (the computer vision is far more developed than AI planning right now), but I think you need to reconsider the formulation of your tasks. It seems like they are designed to be easy enough for modern methods to solve.

On another note, most interesting tasks can’t be completed with just an x,y mouse location output. Why did you decide to restrict the benchmark to such a limited set of tasks?

1

currentscurrents t1_j4rcc3e wrote

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?

10

__lawless t1_j4r9ebs wrote

Ok let me elaborate a bit. Imagine the old model is called m_0. Your newly obtained training data is X, y, features and labels, respectively. Now calculate the residual error which is the difference between y and prediction of m_0: dy = y - m_0(X). Now train a new model m_1. The labels and features are X, dy. Finally at inference time the prediction is the sum of the two models: y_pred = m_0(X_new) + m_1(X_new).

1

monkeysingmonkeynew OP t1_j4r6lwj wrote

this sounds pretty cool. but I don't follow every step. By "calculate the errors" do you mean for example, extract the predicted probabilities from the actual outcome?

Also, I didn't get your last part about inference, what exactly are you referring to there?

2

trnka t1_j4r661s wrote

Think about it more like autocomplete. It's able to complete thoughts coherently enough to fool some people, when provided enough input to complete from. It's often incorrect with very technical facts though.

It's really about how you make use of it. In scientific work, you could present your idea and ask for pros and cons of the idea, or to write a story about how the idea might fail horribly. That can be useful at times. Or to explain basic ideas from other fields.

It's kinda like posing a question to Reddit except that ChatGPT generally isn't mean.

There are other approaches like Elicit or Consensus that use LLMs more for literature review which is probably more helpful.

1

niclas_wue OP t1_j4r5wb1 wrote

Yes, it can be applied to every document, a book would be more expensive, because it has more text and thus more input tokens. The pdf needs to be converted to text, because the API only accepts text, some equations which can be written using Unicode are directly put into the network and it can understand. Other equations are currently skipped. So far I have spent almost 100$ in tokens to summarize the papers, so there need to be some paid features in the near future or a reduction in the amount of papers.

1

TrueBirch t1_j4r4u2n wrote

>combining these technologies is a no brainer

Agreed. I look at the GPT family of models as infrastructure. The real potential comes from layering specific applications on top of it. Imagine every random high school baseball game got a writeup on the local news website. You'd need to ingest sports data and do other pipeline work, but the result could be profitable.

1

ChangingHats t1_j4r2hxx wrote

I am trying to utilize tensorflow's MultiHeadAttention to do regression on time series data for forecasting of a `(batch, horizon, features)` tensor.

During training, I have `inputs ~> (1, 10, 1)` and `targets ~> (1, 10, 1)`. `targets` is a horizon-shifted output of `inptus`.

During inference, `targets` is just a zeros tensor of the same shape.

What's the best way to run attention such that the output utilizes all timesteps in `inputs` as well as each subsequent timestep of the resulting attention output, instead of ONLY the timesteps of the inputs?

Another problem I see is that attention is run between Q and K, and during inference, Q = K, so that will affect the output differently, no?

1