Recent comments in /f/MachineLearning

danielgafni t1_j7wsnlw wrote

The approach you are describing isn’t the best.

  1. There is no sense in rendering these images as OHLCV data is timeseries, not 2D images. Most of the data would just be white pixels. Which is not really wrong but is greatly inefficient. Instead of using 2D convolutions 1D convolutions can be used on the timeseries directly (which is called a Wavenet) which would remove rendering from your pipeline and greatly speedup training and inference.

  2. OHLCV data won’t give you enough information to neither predict the future or backtest your trading algorithm accurately due to loss of data after aggregations.

4

YOLOBOT666 t1_j7wrm1z wrote

What about saving the dataset into batches as individual files, then use the data loader to load the files as batches for transformers? Keeping the batch size reasonable for the GPU memory.

For any preprocessing/scaling, this could be done on the CPU side and would not consume much memory^

2

express_mode_420 t1_j7wizoa wrote

I'm not sure how I'd go about syncing it, but would this be an adequate workaround:

  • break apart your script in small chunks by time stamp
  • generate different tts recordings off of each time stamp
  • generate an audio file that inserts each of the produced recordings at their respective time-stamped location
  • replace the audio of the recording with your newly produced recording
2

CeFurkan OP t1_j7whf7d wrote

i have vtt file you know the subtitles we use for movies

but i haven't found and text to speech that can generate speech with that timing

do you know any?

​

about your suggested approach, any way to automatically do it? i mean we generate speech then we sync but how?

1

currentscurrents t1_j7wf3u0 wrote

>What is the standard modeling approach to these kinds of problems?

The standard approach is reinforcement learning. It works, but it's not very sample-efficient and takes many iterations to train.

LLMs are probably so good at this because of their strong meta-learning abilities; during the process of pretraining they not only learn the task but also learn good strategies for learning new tasks.

This has some really interesting implications. Pretraining seems to drastically improve sample efficiency even if the pretraining was on a very different task. Maybe we could pretrain on a very large amount of synthetic, generated data before doing our real training on our finitely-sized real datasets.

40

ggf31416 t1_j7waxlu wrote

It will depend on how much preprocessing and augmentation is needed. I don't think text needs much preprocessing or augmentation, but for example image classification or detection training needs to create a different augmented image on each iteration and will benefit from a more powerful processor.

Note that you can also use cloud services. If you aren't dealing with confidential data vast ai often is one of the cheapest, otherwise you can use Lambda Labs, Google Engine, AWS or other services. At least in the case of Google Engine and AWS you have to request access to GPU instances, which may take some time.

2

zanzagaes2 OP t1_j7w5sr1 wrote

I will try encoder-decoder architecture, mainly to try to improve the embedding. Right now asymptotics of PCA have not proven a problem, sklearn implementation performs PCA on ~1.000 features vectors almost immediately.

Do you have any reference on any encoder-decoder architecture I can use?

1

vannak139 t1_j7w5otz wrote

So, the simple strategy here, which kind of ignores your variable length objects, is to simply classify CNN receptive fields directly, and then Max Pool the multiple classification frames.

So, lets say that your sequence is 1024. You build a CNN that has a receptive field of 32, and a stride of 16. This network applied to the sequence will offer something like 63 "frames". Typically, the CNN would expand this network representation up with a large number of channels, take the GlobalMaxPooling to merge these frame's information, and then classify the sample.

Instead, you should classify the frames directly, meaning your output looks like 63 separate sigmoid classifications associated with regions of the signal. Then, you simply take the maximum of each classification likelihood, and use this for your image-level classification.

After training, you can remove the GlobalMaxPooling layer, and look at the segment classifications directly.

1