Recent comments in /f/MachineLearning
pitcher_slayer7 t1_jb5wxsv wrote
Reply to comment by pitcher_slayer7 in What is the future of AI in medicine? [D] by adityyya13
I will also add I do not see mid-level + AI/ML as being the replacement for physicians. It will be physicians + AI/ML with arguably less utility for mid-levels in the future, as the scope of mid-level practice is more likely to be affected by AI/ML before a physician scope of practice.
pitcher_slayer7 t1_jb5vong wrote
Reply to What is the future of AI in medicine? [D] by adityyya13
One of the largest problems so far in health care is the most important thing in regards to AI/ML. Data. Yes, electronic health records now exist which is a huge step up from the paper charting of the not so distant past. However, the large EHR companies have multiple inputs for data that are not easily accessible and often times in multiple different forms. A lot of times, the necessary and clinically relevant information is not in a check-box or numerical format, but is free-text with myriad ways of describing a feature that may be hard to quantify. Additionally, often times the purpose behind charting, or put in AI/ML terms “collecting and transcribing data,” is for billing purposes, which further complicates the problem of having good data. My $0.02 is that ML methods like NLP will become more useful for chart digging purposes and trying to collect and organize data in meaningful ways. Most of a physician’s time now is currently spent charting, so the most likely applications of AI/ML will be in automating annoying tasks that physicians do not like to do in the first place. What will happen is that physicians that do not incorporate AI/ML in the future will be replaced by physicians that do use AI/ML to augment their clinical decision-making. Medicine, in my opinion, is a field in which physicians will continue to be people for the long-term future.
PacmanIncarnate t1_jb5vofk wrote
Reply to comment by AuspiciousApple in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.
The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.
[deleted] t1_jb5vj7e wrote
Reply to What is the future of AI in medicine? [D] by adityyya13
[deleted]
PrimaCora t1_jb5rojz wrote
Reply to comment by JrdnRgrs in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
The quality issues came more from the fact they square cropped everything. A photo of a guy wearing a crown isn't great to learn from when he's looking like King Charles I.
The duplication just leads to over fitting. If you train a model on one picture, it's going to make that picture pretty dang good. If you train on millions and have a dozen duplicates, it's going to favor those duplicates pretty heavily. And other combinations, like a duplicate photo that has the unique keyword Zhanfuur, would be the only thing it could make it you just input that keyword.
If they retrain with the new bucketing, it should alleviate the crop issue. Deduplication would help reduce over fit. Both together should lead to better quality, size variation, and variety of text input (hopefully for that last one).
midasp t1_jb5p7v0 wrote
Reply to comment by AuspiciousApple in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
In my experience, all of these issues will occur. It's going to vary from model to model. To be certain, you still have to make an objective test to determine whether the impact is positive or negative, and measure the significance of the impact.
starstruckmon t1_jb5ojrl wrote
Reply to comment by SaifKhayoon in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
No
Yes
alterframe t1_jb5oel8 wrote
Reply to To RL or Not to RL? [D] by vidul7498
RL is one of those concepts where it's easy to fool ourselves that we get it, but in reality we don't. We have this fuzzy notion of what RL is and what it is good for, so in our imagination this is going to be a perfect match for our problem. In reality, our problem may look like those RL-friendly tasks on the surface, but we are lacking several important properties or challenges to really make it reasonable.
It doesn't mean that this is not useful at all. Quite opposite. People are wrongly discouraged from RL, based on experience with projects where it didn't actually make sense, and draw false conclusions about it's practicality.
von-hust OP t1_jb5l3r0 wrote
Reply to comment by [deleted] in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
well just want to be clear these are actually near duplicates (like image should only differ up to compression, small artifacts or even imperceptible differences). ill try to be more explicit by what i mean by duplicate in the github.
NotARedditUser3 t1_jb5kq1q wrote
Reply to comment by AuspiciousApple in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
Honestly I think this is the answer here
clueless1245 t1_jb5khy8 wrote
Reply to comment by [deleted] in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
You want this done in a controlled, methodical and documented manner, not earlier research which showed SD 1.5 to verbatim copy every line and minute contour of wood grain in a specific copyrighted "wooden table" background, found after training to be repeated tens of thousands of times in the input dataset (due to websites selling phone cases photoshopping phones onto it).
timo_kk t1_jb5jkef wrote
frankod281 t1_jb5j6si wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
Maybe check datacrunch.io they have a good offering for cloud GPU.
[deleted] t1_jb5iei8 wrote
[deleted]
currentscurrents t1_jb5hswo wrote
von-hust OP t1_jb5fjqo wrote
Reply to comment by Albino_Jackets in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
The stallone pic is generated by SD, I'm misunderstanding something. There are false positives, but they shouldn't be "rotated 90 degrees" as you say. The dup's mostly match raw clip feature duplicates.
trnka t1_jb5f89k wrote
Reply to [D] Best way to run LLMs in the cloud? by QTQRQD
Related, there's a talk on Thursday about running LLMs in production. I think the hosts have deployed LLMs in prod so they should have good advice
TikiTDO t1_jb5f4p2 wrote
Reply to comment by JrdnRgrs in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
Honestly, the biggest problem with the dataset isn't the duplicates. It's the fact that most of the annotations are kinda crap. You know the saying an image is worth a thousand words. That may be too much for SD, but it will happily chew on 50-75 tokens. SD really wants a bunch of content it can parse on in order to understand concepts and how those concepts relate to each other, but most LAION annotations are short and simple.
From my experience, refining the model with a few hundred images with proper long-form annotations describing what you want can go a long way, even for complex things like hands.
von-hust OP t1_jb5ef3f wrote
Reply to comment by LetterRip in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
I would, but I don't have the CLIP features. I'll release some training
code so that it's possible for others to train their indices. The method
should scale to 5B, even on a single node, you'll just need more RAM.
[deleted] t1_jb5ecvm wrote
Reply to comment by LetterRip in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
[deleted]
Albino_Jackets t1_jb5cq6x wrote
The duplicates aren't perfect duplicates and are added to create more robust model results. Like how an image of a giraffe rotated 90 degrees is still a giraffe even though the patterns are no longer the same. Same thing applies with the Stallone pic, the noise and errors help the model deal with suboptimal image quality
LetterRip t1_jb5bgvj wrote
Greatly appreciated, you might run it on aesthetic and 5B also.
Elon_Muskoff t1_jb59xxp wrote
Reply to comment by knight1511 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Stack Overload
frequenttimetraveler t1_jb60ifw wrote
Reply to What is the future of AI in medicine? [D] by adityyya13
I mean, you left the biggest blocker in the last place. It's amazing that in 2023 a visit to the doctor involves measuring blood pressure and 'listening' to your lungs. My guess is the first mass medical devices will be pirated from some awkward place because regulators won't approve them for sale. Isn't it the same reason why the iphone cant even measure SpO2 ?
And then you have the "AI Safety" mob which will prevent life-saving devices because they are biased to the blood samples of rich country dwellers.
Considering the general lack of progress in how physicians work for decades (vs the progress in drug and diagnostic devices), it seems these blockers will linger for a while
Also, consider COVID. Despite having billions and billions of cases, relatively very few studies have emerged that use same procedurs for measuring indicators, because doctors tend to stick to old, incompatible methods despite the availability of more modern alternatives. Or something like long covid, which despite billions of cases as well, is relatively understudied because records of cases were not taken, wasnt even recognized as a condition for many, and too many MDs rely on their "hunch".
In short, the Medical profession has not embraced AI , which is a requirement