Kaleidophon t1_jah1xwe wrote on March 1, 2023 at 11:27 AM

Reply to comment by RaeudigerRaffi in [D] backprop through beam sampling ? by SaltyStackSmasher

You can backpropagate through samples of a categorical distribution using Gumbel softmax, and as far as i remember you can apply a reparameterization trick for all distributions of the exponential family.

Kaleidophon t1_jah1ke1 wrote on March 1, 2023 at 11:23 AM

Reply to [D] backprop through beam sampling ? by SaltyStackSmasher

I think what you are looking for is the Gumbel-softmax trick, which is basically differentiable sampling for categorical distributions. But in your case the problem will be that BLEU is not differentiable, and often in MT you find that when you try to directly optimize for some translation quality metric, the actual quality as assessed by human judges decreases.

kduyehj t1_jah1cry wrote on March 1, 2023 at 11:20 AM

Reply to Is there any model that classify singing and speaking? [R] by Stencolino

Maybe you don’t need a model. Try Fourier analysis to analyse in the frequency domain. Look at the distributions and do something like Kullback–Leibler divergence to measure your sample distribution against a singing-reference, and against a talking-reference.

RaeudigerRaffi t1_jagjc74 wrote on March 1, 2023 at 7:12 AM

Reply to [D] backprop through beam sampling ? by SaltyStackSmasher

So in general it is not possible to backpropagate directly through any operation that involves sampling from a probability distribution. There are however techniques( policy learning, reparameterization trick) from reinforcement learning that try to circumvent this problem.

SaltyStackSmasher OP t1_jagisv7 wrote on March 1, 2023 at 7:06 AM

Reply to comment by cnapun in [D] backprop through beam sampling ? by SaltyStackSmasher

thanks for the response. my main concern with beam sampling and backprop is the fact that context for the 2nd token will include 1st token. I believe in the RNN case, this wouldn't necessarily matter since only the hidden state is being propagated forward. In transformers, we have to completely redo the forward pass for 2nd token onwards and these subsequent forward passes don't have anything in common, so I'm a bit confused about how the gradients will flow exactly.

please let me know if I wasn't clear in explaining my problem. thanks again for your response :)

cnapun t1_jage50a wrote on March 1, 2023 at 6:10 AM

Reply to [D] backprop through beam sampling ? by SaltyStackSmasher

I'm not an expert on this topic, but I've discussed it with coworkers. I do believe you should be able to backprop through sampling, mathematically at least. My suspicion is that you'll run into the same problem as you have with RNNs, where backpropping through many steps leads to high variance in gradients. I'd search for some papers that have explored this; I assume they exist.

currentscurrents t1_jagavxg wrote on March 1, 2023 at 5:35 AM

Reply to comment by blablanonymous in Is there any model that classify singing and speaking? [R] by Stencolino

I can't afford to rent 10x A100s on cloud platforms for very long either,

Pretrained models are pretty great though. Most of my use cases are not particularly unique; models are the software libraries of the future.

blablanonymous t1_jagadpr wrote on March 1, 2023 at 5:30 AM

Reply to comment by currentscurrents in Is there any model that classify singing and speaking? [R] by Stencolino

I mean if you have a serious use case you can do it on cloud platforms. I imagine you could do some transfer learning or fine tuning using a pre trained model

currentscurrents t1_jag9v2w wrote on March 1, 2023 at 5:24 AM

Reply to comment by keph_chacha in Is there any model that classify singing and speaking? [R] by Stencolino

Can I? I don't have a cluster of 10x A100s.

All the interesting stuff in ML seems to require expensive hardware. I guess it'll be cool in 5-10 years when consumer hardware catches up.

keph_chacha t1_jag33wh wrote on March 1, 2023 at 4:22 AM

Reply to Is there any model that classify singing and speaking? [R] by Stencolino

you can build one

new_name_who_dis_ t1_jaf4lmy wrote on February 28, 2023 at 11:56 PM

Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Each float32 is 4 bytes.

tblume1992 t1_jaf3zgc wrote on February 28, 2023 at 11:52 PM

Reply to [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar

Can you guys add model selection and the results of the chosen method to make it more like what we would do in production?

curiousshortguy t1_jaf3aab wrote on February 28, 2023 at 11:47 PM

Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Yeah, about 2-3. You can easily shove layers of the networks on disk, and then load even larger models that don't fit in vram BUT disk i/o will make inference painfully slow.

fedegarzar OP t1_jaevmj5 wrote on February 28, 2023 at 10:52 PM

Reply to comment by More-Horse-3281 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar

I agree. Overfitting is a common problem in AutoML solutions. A proper validation strategy should improve the performance in unseen data, but in our experience, most of the AutoML solutions lack this feature.

lechatsportif t1_jaev941 wrote on February 28, 2023 at 10:49 PM

Reply to comment by bigfish_in_smallpond in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Quote of the century.

fedegarzar OP t1_jaev47v wrote on February 28, 2023 at 10:48 PM

Reply to comment by cristianic18 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar

That's an interesting question. Behind the scenes, BigQuery uses an auto Arima model to extrapolate the trend of the time series after deseasonalizing them (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series). I would say that the complexity of the pipeline makes it slower (also our implementations use numba which speeds up the fitting time).

metal079 t1_jaeuymi wrote on February 28, 2023 at 10:47 PM

Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Rule of thumb is vram needed = 2x per billion parameters, though I recall pygamillion which is 6B says it needs 16GB of ram so it depends.

fedegarzar OP t1_jaeue2g wrote on February 28, 2023 at 10:43 PM

Reply to comment by MyActualUserName99 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar

Yes, I agree with your intuitions. However, we used the datasets from the official BigQuery tutorial (https://cloud.google.com/bigquery-ml/docs/arima-speed-up-tutorial). In particular, it isn't easy to generalize in time series forecasting due to the diversity of the datasets of the field. The central intuition of the experiment is that running less sophisticated methods and pipelines could be a better practice before using AutoML as is.

[deleted] t1_jaeu7ev wrote on February 28, 2023 at 10:42 PM

Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

[removed]

currentscurrents t1_jaetyg1 wrote on February 28, 2023 at 10:40 PM

Reply to comment by dancingnightly in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Can't the reward model be discarded at inference time? I thought it was only used for fine-tuning.

currentscurrents t1_jaetvbb wrote on February 28, 2023 at 10:39 PM

Reply to comment by Beli_Mawrr in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Definitely in the realm of running on your computer. Almost in the realm of running on high-end smartphones with TPUs.

AnOnlineHandle t1_jaesse4 wrote on February 28, 2023 at 10:32 PM

Reply to comment by pawsibility in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

The CLIP model in the Stable Diffusion 1.5 package is 480mb according to my directory where it was unpackaged by diffusers, though I don't know how that translate into parameter count.

AnOnlineHandle t1_jaeshwf wrote on February 28, 2023 at 10:30 PM

Reply to comment by curiousshortguy in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Is there a way to convert parameter count into vram requirements? Presuming that's the main bottleneck?

MyActualUserName99 t1_jaes8gh wrote on February 28, 2023 at 10:28 PM

Reply to [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar

My biggest concerns with this assessment is the lake of dataset diversity. Sure, you can get one method to outperform another on one or two datasets, but to be able to do so across many datasets, all of various sizes, is much much harder.

From what I can tell, the open source StatsForecast was able to outperform BigQuery for an extremely small dataset (Citibike Trips) and one large dataset (Liquor Sales). Granted the much larger dataset, to me, is much more impressive to outperform upon than the smaller. But to make such a definitive conclusion that Open Source is better than commercial would require testing across a plethora of datasets, all of different sizes, domains, etc.

Donno_Nemore t1_jaeq7oy wrote on February 28, 2023 at 10:14 PM

Reply to comment by _throw_hawaii in [D] Running a trained k-means clustering on new data with maximum number of iterations equal to zero or not? by _throw_hawaii

This sounds like you are being asked to assign the new data to a cluster. Assignment is as simple as calling the distance function for each pair of point and centroid. The minimum score is the cluster assignment.

Recent comments in /f/MachineLearning