Recent comments in /f/MachineLearning
muwnd t1_j9qvxwf wrote
Better save yourself from all the crawling trouble and use data from Commoncrawl. So you can focus on the extraction part.
visarga t1_j9qvckw wrote
Reply to comment by Wiskkey in [N] U.S. Copyright Office decides that Kris Kashtanova's AI-involved graphic novel will remain copyright registered, but the copyright protection will be limited to the text and the whole work as a compilation by Wiskkey
So, all you need to do is use a source image you made yourself, and then the whole image belongs to you. Nobody can extract just the AI contribution.
[deleted] t1_j9qtso8 wrote
Reply to comment by Desticheq in [P] What are the latest "out of the box solutions" for deploying the very large LLMs as API endpoints? by johnhopiler
[removed]
cantfindaname2take t1_j9qov0f wrote
Reply to comment by Seankala in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
For simple NER tasks some simpler models might work too,like conditional random fields. The crfsuite package has a very easy to use implementation of it and it is using a C lib under the hood for the model training.
cantfindaname2take t1_j9qohtm wrote
Reply to comment by Friktion in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
Yeah, I would try FastText before LLMs.
Desticheq t1_j9qo0mu wrote
Reply to [P] What are the latest "out of the box solutions" for deploying the very large LLMs as API endpoints? by johnhopiler
Hugginface actually allows a fairly easy deployment process for models trained with their framework
koolaidman123 t1_j9qhp9n wrote
Reply to [D] Model size vs task complexity by Fine-Topic-6127
i have rarely encountered situations where scaling up mode (eg resnet34 -> resnet50, deberta base -> deberta large/xl) doesn't help. whether it's practical to may be a different story
thecuteturtle t1_j9qawia wrote
Reply to comment by floppy_llama in [D] Model size vs task complexity by Fine-Topic-6127
Ain't that the truth. On another note, OP can try optimizing via gridsearch, but theres no avoiding trial and error on this.
MadScientist-1214 t1_j9q1bll wrote
Reply to [D] Model size vs task complexity by Fine-Topic-6127
The only shortcut I can give you is to look on Kaggle to see what the competitors have used. Most of the papers are not suitable for real world applications. It's not really about the complexity or scale of the task, but rather that the authors leave out some important information. For example, in object detection, there is DETR, but if you look on Kaggle, nobody uses that. The reason is that the original DETR has too slow a convergence speed and was only trained on 640 size images. Instead, many people still use YOLO. But you don't realize that until you try it yourself or someone tells you.
sam__izdat t1_j9q0nd6 wrote
Reply to comment by vyasnikhil96 in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
likewise, thanks for sharing your work
dmart89 OP t1_j9po770 wrote
Reply to comment by KPTN25 in [D] Python library to collect structured datasets across the internet by dmart89
Just a library not a commercial tool. Anyone using it would be scraping themselves, not via a 3rd party service or something.
JGoodle t1_j9pnn8c wrote
Reply to [D] Simple Questions Thread by AutoModerator
Convoluted neural networks work on images. What’s the equivalent for videos?
tyras_ t1_j9pjkcx wrote
Reply to comment by nikola-b in [D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM by head_robotics
I finally got some time and was excited to try out. I did not see many LLMs pretrained on biomedical data available anywhere.
Anyway, while I could log in without a problem both CURL and deepctl return 401. Now I wonder whether it was cut off or did I miss some extra registration or authorization step that was not mentioned in the docs.
saffronanas t1_j9pho4p wrote
Reply to comment by t35t0r in [P] Introducing arxivGPT: chrome extension that summarizes arxived research papers using chatGPT by _sshin_
Looks like the preview has ended. Where can I use it now?
grumpyp2 t1_j9pbhan wrote
Reply to [D] Simple Questions Thread by AutoModerator
I am going to start my thesis soon.
any idea on where to start with anomaly detection? We have a huge amount of data. It's used for site reliability engineering! Any help welcome!
currentscurrents t1_j9pb0by wrote
Reply to comment by 1973DodgeChallenger in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
You had your source code public until you got freaked out by ChatGPT, so you were entirely okay with publishing it for everyone to see.
ChatGPT doesn't even allow direct access to source code, it's just learning how to solve problems using existing source code as training examples.
namuan t1_j9pagp7 wrote
Maybe try this one . Built on top of Manim
[deleted] t1_j9pa2ae wrote
Reply to comment by 1973DodgeChallenger in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
[deleted]
KPTN25 t1_j9p8zgp wrote
Good luck crawling Linkedin. Not saying it's impossible, but you'll definitely be making your life difficult if you try to publish a tool that is scraping from LI.
Friktion t1_j9oxnz6 wrote
Reply to comment by Seankala in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
I have some experience with FastText for e-commerce product classification. Its super lightweight and performs well as a MVP.
dmart89 OP t1_j9oxk6s wrote
Reply to comment by Sal-Hardin in [D] Python library to collect structured datasets across the internet by dmart89
Probably keeping it simple to start with and just use filters during the crawl.
FluffyVista t1_j9ou9b2 wrote
That's useful, thanks
visarga t1_j9qwl8q wrote
Reply to comment by currentscurrents in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
Diffusion models take about 1 byte of information from each training image - 5B images, 5Gb. So much less than a thumbnail.