Recent comments in /f/MachineLearning

BeatLeJuce t1_j6mlxjc wrote

your question is answered in the abstract itself ("using only pixels and game points as input"), and repeated multiple times in the text ("In our formulation, the agent’s policy π uses the same interface available to human players. It receives raw RGB pixel input x_t from the agent’s first-person perspective at timestep t, produces control actions a_t ∼ π simulating a gamepad, and receives game points ρt attained"). Did you even attempt to read the paper? The concrete architecture showing the CNN is also in Figure S10.

3

PredictorX1 t1_j6mkzl0 wrote

To be clear, there are neural networks which are "deep", and others which are "shallow" (few hidden layers). From a practical standpoint, the latter have more in common with other "shallow" learning methods (tree-induction, statistical regressions, k-nearest neighbor, etc.) than they do with deep learning.

You're right that many people (especially in the non-technical press) have erroneously used "machine learning" to mean specifically "deep learning", just as they've used "artificial intelligence" to mean "machine learning". Regardless, there are still non-deep machine learning methods and other branches of A.I. In practice, non-deep machine learning represents the overwhelming majority of applications today.

I haven't followed the research as closely in recent years, but I can tell you that, deep learning aside, people have only begun to scratch the surface of machine learning application.

54

bitRAKE t1_j6mj7s2 wrote

  1. Ask ChatGPT for an explanation of anything without a known correct answer, and then tell it that "that answer is incorrect". It will proceed to dream up a new answer. This could be non-existent syntax for a programming language, for example. The sequential nature of the model means it can paint itself into a corner quite easily.

  2. Isn't knowledge accuracy a by-product of modeling correct language use to some degree, and not the design goal of the system? A fantasy story is just as valid a language use as a research paper. Accuracy seems to correlate with how the system is primed for the desired context.

2

arg_max t1_j6mg664 wrote

I think diffusion models are kind of a bad example. The SDE paper from Yang Song has shown that it's all about modeling the score function and this can't be done with simple models. Apart from that, the big text2img models work inside the latent space of a deep vae, make use of conditioning using cross attention which isn't a thing in traditional ML and use large language models to process the text input. All their components are very dl based.

13

worriedshuffle t1_j6mduii wrote

> I’d say that even calling it “AI” is misleading because it’s not intelligent.

I’d say it’s misleading for a different reason. We don’t know what intelligence is. Every time a computer can perform a task, that task is no longer considered a test of “intelligence”. Well, if every task is reducible to something unintelligent then perhaps intelligence was really a mirage in the first place.

5

andreichiffa t1_j6mdm66 wrote

On a very high level, transformer-derived architectures struggle with the concept of reality because they need distributions in the token embedding space to remine wide. Especially for larger model, the training data is so sparse that without that they would struggle with generalization and exposure biais.

Repeated prompting and prompt optimization can pull out elements of training set from it (in some cases), because in the end they do memorize, but the exact mechanism is not yet clear and cannot be counted on.

You can go around it by adding a « critic » post-processor that would classify if model tries to mention a fact, look it up, and force it to re-generate until statement is factually correct. This is very close to GeDi, the Guided Generation introduced by a Salesforce team back in 2020. Given that OpenAI went this route for ChatGPT and InstructGPT to make them less psycho and more useful to the end users (+ iterative fine-tuning from user's and critic model input), there is a good chance they will go this route as well.

You can also add discrete non-differentiable layers to train model to recognize factual statements from others in-text text and learn to switch between the modes allowing it to process them differently. However, you loose nice back-propagation properties and have to do black-box optimization on discrete layers, which is costly, even by LLM standards. That seems to be the Google approach with PaLM.

3

qalis t1_j6mczg1 wrote

Absolutely not! There is still still a lot of research going into traditional ML methods. For tabular data, it is typically vastly superior to deep learning. Especially boosting models receive a lot of attention due to very good implementations available. See for example:

- SketchBoost, CuPy-based boosting from NeurIPS 2022, aimed at incredibly fast multioutput classification

- A Short Chronology Of Deep Learning For Tabular Data by Sebastian Raschka, a great literature overview of deep learning on tabular data; spoiler: it does not work, and XGBoost or similar models are just better

- in time series forecasting, LightGBM-based ensembles typically beat all deep learning methods, while being much faster to train; see e.g. this paper, you can also see it at Kaggle competitions or other papers; my friend works in this area at NVidia and their internal benchmarks (soon to be published) show that top 8 models in a large scale comparison are in fact various LightGBM ensemble variants, not deep learning models (which, in fact, kinda disappointed them, since it's, you know, NVidia)

- all domains requiring high interpretability absolutely ignore deep learning at all, and put all their research into traditional ML; see e.g. counterfactual examples, important interpretability methods in finance, or rule-based learning, important in medical or law applications

292

qalis t1_j6mbu5s wrote

I recently complied and went through a reading / watching list, going from basic NLP to ChatGPT:

- NLP Demystified to learn NLP, especially transformers

- Medium article nicely summarizing the main points of GPT-1, 2 and 3

- GPT-1 lecture and GPT-1 paper to learn about general idea of GPT-like models

- GPT-2 lecture and GPT-2 paper to learn about large scale self-supervised pretraining that fuels GPT training

- GPT-3 lecture 1 and GPT-3 lecture 2 and GPT-3 paper to learn about GPT-3

- InstructGPT page and InstructGPT paper to learn about InstructGPT, the sibling model of ChatGPT; as far as I understand, this is the same as "GPT-3.5"

- ChatGPT page to learn about differences between InstructGPT and ChatGPT, which are relatively small as far as I understand; it is also sometimes called "fine-tuned GPT-3.5", AFAIK

Bonus reading (heavy math warning, experience with RL required!):

- the main difference between GPT-3 and InstructGPT/ChatGPT is reinforcement learning with human feedback (RLHF)

- RLHF is based on Proximal Policy Optimization algorithm

- PPO page and PPO paper

3

antodima OP t1_j6m5aqd wrote

Basically is the feasibility ridge regression with sparse inputs, but I want to select partial units of W acting on A and B. For instance, if I have A of (2x5) and B of (5x5) and I choose units 2 and 4, the columns [0,1,3] of A are zeros and columns and rows of B with index [0,1,3] are also zero. I select the units 2 and 4 with some importance mechanism. The question is: there is a way of having W* resulting from filter A and B that is similar to W computed without filtering A and B?

I asked because filtering A and B break the inversion and so the computation of W. I don't know if there exists some way of decomposing B in order to invert more easily or something like this.

Anyway thanks for your interest!

1

oh__boy t1_j6m3sah wrote

Interesting, thanks for the detailed answer. This is cool work, I also love to work on projects which squeeze out every last ounce of performance possible to solve a problem. I am somewhat skeptical of how much this applies to other architectures / datasets / problems, since you seem to only have worked on one network and one dataset. I hope you try to find general concepts and show that they apply to more than just that network and dataset and prove me wrong though. Good luck with everything!

2

currentscurrents t1_j6m3ik5 wrote

We could make models with trillions of parameters, but we wouldn't have enough data to train them. Multimodality definitely allows some interesting things but all existing multimodal models still require billions of training examples.

More efficient architectures must be possible - evolution has probably discovered one of them.

1

abcdchop t1_j6m17n8 wrote

wait bro the key benefit is the the hierarchical description -- the "language" is just a format for explaining the hierarchical description of the problem in natural language, I think that the improvements your suggesting pretty much describe the paper itself

6

ezelikman t1_j6lx0vm wrote

Hi, author here!

There are a few ways to interpret this question.

The first is, "why generate a bunch of composable small functions - why not generate complete Python/Lean/etc. implementations directly from the high-level sketch?" If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less. You can see the benefits in Fig. 6 and our direct compilation ablation. There's also the context window: a hundred 500-token functions from Parsel is a 50,000-token program. You won't get that with Codex alone.

Another interpretation is, "why do you need to expose intermediate language when you can use a more abstract intermediate representation." You suggest "leveraging the value of LLMs--through a more natural language interface." That's the goal. Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality - ideally, people who've never used Python can understand and write Parsel. The "expert" details here aren't syntax: most people are unfamiliar with the nuances of writing natural language that automatically compiles to code, like the value of comprehensive unit tests.

Another is, "why design a new language instead of writing this as, e.g., a Python library?" My response is we did this too. Internally, Parsel is in Python, and a "Function" class already exists - you can find it on GitHub. Still, you need a process to generate implementations and select one satisfying the constraints, which we call the compiler.

Hope this answers your question!

11

TheCoconutTree t1_j6lu39i wrote

Formatting lat/lng data for neural net feature input:

I've got latitude/longitude columns in a sql table that I'd like to add as features for a neural net classifier model. In terms of formatting for input, I plan to normalize latitude values to a range between 0-1, with 0 mapping to the largest possible negative lat value, and 1 mapping to the largest possible positive lat value. Then do the same for longitude, and pass them in as separate features.

Does that seem like a reasonable approach? Any other tricks I should know?

1