Recent comments in /f/MachineLearning

Kaleidophon t1_jah1ke1 wrote

I think what you are looking for is the Gumbel-softmax trick, which is basically differentiable sampling for categorical distributions. But in your case the problem will be that BLEU is not differentiable, and often in MT you find that when you try to directly optimize for some translation quality metric, the actual quality as assessed by human judges decreases.

1

RaeudigerRaffi t1_jagjc74 wrote

So in general it is not possible to backpropagate directly through any operation that involves sampling from a probability distribution. There are however techniques( policy learning, reparameterization trick) from reinforcement learning that try to circumvent this problem.

4

SaltyStackSmasher OP t1_jagisv7 wrote

thanks for the response. my main concern with beam sampling and backprop is the fact that context for the 2nd token will include 1st token. I believe in the RNN case, this wouldn't necessarily matter since only the hidden state is being propagated forward. In transformers, we have to completely redo the forward pass for 2nd token onwards and these subsequent forward passes don't have anything in common, so I'm a bit confused about how the gradients will flow exactly.

please let me know if I wasn't clear in explaining my problem. thanks again for your response :)

2

cnapun t1_jage50a wrote

I'm not an expert on this topic, but I've discussed it with coworkers. I do believe you should be able to backprop through sampling, mathematically at least. My suspicion is that you'll run into the same problem as you have with RNNs, where backpropping through many steps leads to high variance in gradients. I'd search for some papers that have explored this; I assume they exist.

5

curiousshortguy t1_jaf3aab wrote

Yeah, about 2-3. You can easily shove layers of the networks on disk, and then load even larger models that don't fit in vram BUT disk i/o will make inference painfully slow.

10

fedegarzar OP t1_jaev47v wrote

That's an interesting question. Behind the scenes, BigQuery uses an auto Arima model to extrapolate the trend of the time series after deseasonalizing them (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series). I would say that the complexity of the pipeline makes it slower (also our implementations use numba which speeds up the fitting time).

3

fedegarzar OP t1_jaeue2g wrote

Yes, I agree with your intuitions. However, we used the datasets from the official BigQuery tutorial (https://cloud.google.com/bigquery-ml/docs/arima-speed-up-tutorial). In particular, it isn't easy to generalize in time series forecasting due to the diversity of the datasets of the field. The central intuition of the experiment is that running less sophisticated methods and pipelines could be a better practice before using AutoML as is.

3

AnOnlineHandle t1_jaesse4 wrote

The CLIP model in the Stable Diffusion 1.5 package is 480mb according to my directory where it was unpackaged by diffusers, though I don't know how that translate into parameter count.

2

MyActualUserName99 t1_jaes8gh wrote

My biggest concerns with this assessment is the lake of dataset diversity. Sure, you can get one method to outperform another on one or two datasets, but to be able to do so across many datasets, all of various sizes, is much much harder.

From what I can tell, the open source StatsForecast was able to outperform BigQuery for an extremely small dataset (Citibike Trips) and one large dataset (Liquor Sales). Granted the much larger dataset, to me, is much more impressive to outperform upon than the smaller. But to make such a definitive conclusion that Open Source is better than commercial would require testing across a plethora of datasets, all of different sizes, domains, etc.

7