Recent comments in /f/MachineLearning
pronunciaai t1_j6l49ij wrote
Reply to comment by blackkettle in [D] What's stopping you from working on speech and voice? by jiamengial
Yeah I work in the space (mispronunciation detection) and there is not a lack of frameworks, (speechbrain, NeMo, and thunder-speech being the more useful ones for custom stuff imo). The barrier to entry is all the stuff you have to learn to do audio ML, and all the pain points around stuff like CTC. Tutorials are more needed than frameworks to get more people actively working on speech and voice in my opinion.
pronunciaai t1_j6l3unb wrote
Reply to comment by jiamengial in [D] What's stopping you from working on speech and voice? by jiamengial
Have you tried https://github.com/scart97/thunder-speech? It's a smaller repo that is based off of NeMo, but meant to be for more flexible experimentation, and is compatible with huggingface's transformers library.
pronunciaai t1_j6l3lzv wrote
Reply to comment by TheCoconutTree in [D] Simple Questions Thread by AutoModerator
Your suggested approach is the correct one and is called "one-hot encoding". Your thinking about why an embedding (single learned value) is inappropriate is also accurate.
Vegetable-Skill-9700 t1_j6l1k7r wrote
Personally I find collecting and understanding to be really hard when it comes to speech. Like with images I can visualise a lot of them at once however with speech I'll have to listen to them one by one
starfries t1_j6l0aeq wrote
Reply to comment by anony_sci_guy in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78
Thanks for that resource, I've been experimenting with the lottery ticket method but that's a lot of papers I haven't seen! Did you initialize the weights as if training from scratch, or did you do something like trying to match the variance of the old and new weights? I'm intrigued that your method didn't hurt performance - most of the things I've tested were detrimental to the network. I have seen some performance improvements under different conditions but I'm still trying to rule out any confounding factors.
memberjan6 t1_j6l082i wrote
I watch ml news channel on YouTube. If it's a white dude always in sunglasses, that's it.
thanks_champagne t1_j6l062p wrote
Reply to [D] Simple Questions Thread by AutoModerator
How do I find someone who has access to medical imaging models? I have found a couple open source resources but not sure if I have the skills/time to install the code. Specifically, I would like machine learning to analyze the scans I have of my left eye. I have a rare eye condition that has so far been deemed idiopathic.
daidoji70 t1_j6kz2kq wrote
It sounds really boring. (edit: I really have never had a need for speech or voice so far in my career haha. Good luck on making tooling).
thevillagersid t1_j6kvatk wrote
Reply to [D] Sparse Ridge Regression by antodima
Are you asking about the feasibility ridge regression with sparse inputs? Or about regularization to enforce a sparse solution?
should_go_work t1_j6kooe8 wrote
Reply to [D] Sparse Ridge Regression by antodima
If your goal is to do linear regression and enforce hard sparsity constraints on W, then there are several algorithms to do this directly (not guaranteed to recover the true sparse W unless certain conditions are met though). A simple starting point might be orthogonal matching pursuit: https://scikit-learn.org/stable/auto_examples/linear_model/plot_omp.html.
umotex12 t1_j6kjlm9 wrote
Reply to comment by ginsunuva in [D] MusicLM: Generating Music From Text by carlthome
yeah they are "just" exploiting a new technique that is storming all media generation. it's not like they are inventing something new everytime
babua t1_j6khgfr wrote
Reply to comment by psma in [D] What's stopping you from working on speech and voice? by jiamengial
I don't think it stops there either, streaming architecture probably breaks core assumptions of some speech models. e.g. for STT, when do you "try" to infer the word? for TTS, how do you intonate the sentence correctly if you don't know the second half? You'd have to re-train your entire model for the streaming case and create new data augmentations -- plus you'll probably sacrifice some performance even in the best case because your model simply has to deal with more uncertainty.
edunuke t1_j6k9zjc wrote
Reply to [D] Remote PhD by TheRealMrMatt
In the UK there are at least 3 types of PhDs: 1) PhD by thesis, 2) PhD by publication and 3) profesional PhD.
Options 2 and 3 may be compatible to what you want. It really depends on your advisor, the source of funding, and your motivation. There are many reasons to pursue a phD that doesn require an academic motivation and you are entitled to your own reasons. I've met people that have done PhDs while also having a Job but those are the exception rather than the norm and challenging to say the least.
HateRedditCantQuitit t1_j6k9goi wrote
[deleted] t1_j6k5a9t wrote
[removed]
like_a_tensor t1_j6k1nde wrote
I feel like there's a lot of signal processing math in speech and voice that I have zero background in. Even though everything is deep learning now, speech and voice architectures seem more complex than in other fields.
fasttosmile t1_j6jzvyw wrote
Reply to comment by uhules in [D] What's stopping you from working on speech and voice? by jiamengial
That does not make sense. You don't need kaldi to use the new libraries. And lhotse can be used totally independently of k2 or icefall.
likenedthus t1_j6jy1mw wrote
Reply to [D] Remote PhD by TheRealMrMatt
You probably won’t find a good PhD program that is “advertised” as being online. There are just too many variables. That said, it’s absolutely possible to work something out with your advisors that is effectively remote, assuming that most of your courses can be taken remotely and your research doesn’t require tools/resources that you cannot access remotely.
I’m doing my PhD in cognitive science at an international university (I live in the states). Plenty of online coursework has been made available to me to meet those requirements. But I am 100% responsible for proposing and executing remote advisement/evaluation. I’m also required to visit the university at 1–2 times per year for 1–3 weeks per visit.
nielsrolf t1_j6jvtq0 wrote
Inference time is an issue for me at the moment. I tried openai whisper on replicate and hosted it on banana.dev but both take too long. I would like to use it for a conversational bot, so 50s is too long to transcribe 7s of audio, but this is what I got so far.
uhules t1_j6juq7x wrote
Reply to comment by fasttosmile in [D] What's stopping you from working on speech and voice? by jiamengial
Lhotse is basically part of the "Kaldi 2.0 ecosystem" (K2/Lhotse/Icefall/Sherpa), you'll probably see people referring to the whole lot as Kaldi as well.
RedditIsDoomed-22 t1_j6ju2p3 wrote
Computing cost, storing and processing speech data is so expensive.
Brudaks t1_j6jqizr wrote
Availability of corpora for other languages.
If you care about much less resourced languages than English or the big ones, then you can generally get sufficient text to do interesting stuff, but working with speech becomes much more difficult due to very limited quantity of decent quality data.
[deleted] t1_j6jnws8 wrote
MrAcurite t1_j6jlqmi wrote
That I don't want to.
prototypist t1_j6ljszc wrote
Reply to [D] What's stopping you from working on speech and voice? by jiamengial
I just barely got into text NLP when I could run notebooks with a single GPU / CoLab and get interesting outputs. I've seen some great community models (such as Dhivehi language) made with Mozilla Common Voice data. But if I were going to collect a chunk of isiXhosa transcription data, and try to run it on a single GPU, that's hours of training to an initial checkpoint which just makes some muffled noises.At end of 2022 there was a possibility to fine-tune OpenAI Whisper, so if I tried again, I might start there. https://huggingface.co/blog/fine-tune-whisper
Also I never use Siri / OK Google / Alexa. I know it's a real industry but I never think of use cases for it.