Recent comments in /f/MachineLearning
Kaleidophon t1_jah1ke1 wrote
I think what you are looking for is the Gumbel-softmax trick, which is basically differentiable sampling for categorical distributions. But in your case the problem will be that BLEU is not differentiable, and often in MT you find that when you try to directly optimize for some translation quality metric, the actual quality as assessed by human judges decreases.
kduyehj t1_jah1cry wrote
Maybe you don’t need a model. Try Fourier analysis to analyse in the frequency domain. Look at the distributions and do something like Kullback–Leibler divergence to measure your sample distribution against a singing-reference, and against a talking-reference.
RaeudigerRaffi t1_jagjc74 wrote
So in general it is not possible to backpropagate directly through any operation that involves sampling from a probability distribution. There are however techniques( policy learning, reparameterization trick) from reinforcement learning that try to circumvent this problem.
SaltyStackSmasher OP t1_jagisv7 wrote
Reply to comment by cnapun in [D] backprop through beam sampling ? by SaltyStackSmasher
thanks for the response. my main concern with beam sampling and backprop is the fact that context for the 2nd token will include 1st token. I believe in the RNN case, this wouldn't necessarily matter since only the hidden state is being propagated forward. In transformers, we have to completely redo the forward pass for 2nd token onwards and these subsequent forward passes don't have anything in common, so I'm a bit confused about how the gradients will flow exactly.
please let me know if I wasn't clear in explaining my problem. thanks again for your response :)
cnapun t1_jage50a wrote
I'm not an expert on this topic, but I've discussed it with coworkers. I do believe you should be able to backprop through sampling, mathematically at least. My suspicion is that you'll run into the same problem as you have with RNNs, where backpropping through many steps leads to high variance in gradients. I'd search for some papers that have explored this; I assume they exist.
currentscurrents t1_jagavxg wrote
Reply to comment by blablanonymous in Is there any model that classify singing and speaking? [R] by Stencolino
I can't afford to rent 10x A100s on cloud platforms for very long either,
Pretrained models are pretty great though. Most of my use cases are not particularly unique; models are the software libraries of the future.
blablanonymous t1_jagadpr wrote
Reply to comment by currentscurrents in Is there any model that classify singing and speaking? [R] by Stencolino
I mean if you have a serious use case you can do it on cloud platforms. I imagine you could do some transfer learning or fine tuning using a pre trained model
currentscurrents t1_jag9v2w wrote
Reply to comment by keph_chacha in Is there any model that classify singing and speaking? [R] by Stencolino
Can I? I don't have a cluster of 10x A100s.
All the interesting stuff in ML seems to require expensive hardware. I guess it'll be cool in 5-10 years when consumer hardware catches up.
keph_chacha t1_jag33wh wrote
you can build one
new_name_who_dis_ t1_jaf4lmy wrote
tblume1992 t1_jaf3zgc wrote
Can you guys add model selection and the results of the chosen method to make it more like what we would do in production?
curiousshortguy t1_jaf3aab wrote
Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
Yeah, about 2-3. You can easily shove layers of the networks on disk, and then load even larger models that don't fit in vram BUT disk i/o will make inference painfully slow.
fedegarzar OP t1_jaevmj5 wrote
Reply to comment by More-Horse-3281 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar
I agree. Overfitting is a common problem in AutoML solutions. A proper validation strategy should improve the performance in unseen data, but in our experience, most of the AutoML solutions lack this feature.
lechatsportif t1_jaev941 wrote
fedegarzar OP t1_jaev47v wrote
Reply to comment by cristianic18 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar
That's an interesting question. Behind the scenes, BigQuery uses an auto Arima model to extrapolate the trend of the time series after deseasonalizing them (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series). I would say that the complexity of the pipeline makes it slower (also our implementations use numba which speeds up the fitting time).
metal079 t1_jaeuymi wrote
Reply to comment by AnOnlineHandle in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
Rule of thumb is vram needed = 2x per billion parameters, though I recall pygamillion which is 6B says it needs 16GB of ram so it depends.
fedegarzar OP t1_jaeue2g wrote
Reply to comment by MyActualUserName99 in [Discussion] Open Source beats Google's AutoML for Time series by fedegarzar
Yes, I agree with your intuitions. However, we used the datasets from the official BigQuery tutorial (https://cloud.google.com/bigquery-ml/docs/arima-speed-up-tutorial). In particular, it isn't easy to generalize in time series forecasting due to the diversity of the datasets of the field. The central intuition of the experiment is that running less sophisticated methods and pipelines could be a better practice before using AutoML as is.
currentscurrents t1_jaetyg1 wrote
Reply to comment by dancingnightly in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
Can't the reward model be discarded at inference time? I thought it was only used for fine-tuning.
currentscurrents t1_jaetvbb wrote
Reply to comment by Beli_Mawrr in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
Definitely in the realm of running on your computer. Almost in the realm of running on high-end smartphones with TPUs.
AnOnlineHandle t1_jaesse4 wrote
Reply to comment by pawsibility in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
The CLIP model in the Stable Diffusion 1.5 package is 480mb according to my directory where it was unpackaged by diffusers, though I don't know how that translate into parameter count.
AnOnlineHandle t1_jaeshwf wrote
Reply to comment by curiousshortguy in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152
Is there a way to convert parameter count into vram requirements? Presuming that's the main bottleneck?
MyActualUserName99 t1_jaes8gh wrote
My biggest concerns with this assessment is the lake of dataset diversity. Sure, you can get one method to outperform another on one or two datasets, but to be able to do so across many datasets, all of various sizes, is much much harder.
From what I can tell, the open source StatsForecast was able to outperform BigQuery for an extremely small dataset (Citibike Trips) and one large dataset (Liquor Sales). Granted the much larger dataset, to me, is much more impressive to outperform upon than the smaller. But to make such a definitive conclusion that Open Source is better than commercial would require testing across a plethora of datasets, all of different sizes, domains, etc.
Donno_Nemore t1_jaeq7oy wrote
Reply to comment by _throw_hawaii in [D] Running a trained k-means clustering on new data with maximum number of iterations equal to zero or not? by _throw_hawaii
This sounds like you are being asked to assign the new data to a cluster. Assignment is as simple as calling the distance function for each pair of point and centroid. The minimum score is the cluster assignment.
Kaleidophon t1_jah1xwe wrote
Reply to comment by RaeudigerRaffi in [D] backprop through beam sampling ? by SaltyStackSmasher
You can backpropagate through samples of a categorical distribution using Gumbel softmax, and as far as i remember you can apply a reparameterization trick for all distributions of the exponential family.