Recent comments in /f/MachineLearning
Seankala OP t1_j9npctd wrote
Reply to comment by adt in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
Thanks for the detailed answer! My use case is that the company I work at currently uses image-based models for e-commerce purposes, but we want to use text-based models as well. The image-based model(s) are already taking up around 30-50M parameters so I didn't want to just bring in a 100M+ parameter model. Even 15M seems quite big.
Seankala OP t1_j9np9ae wrote
Reply to comment by chogall in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
I guess at least 100M+ parameters? I like to think of the BERT-base model as being the "starting point" of LLMs.
SankarshanaV t1_j9no774 wrote
Reply to comment by Animated-AI in [P] The First Depthwise-separable Convolution Animation by Animated-AI
Thank you!
[deleted] t1_j9nnksc wrote
[removed]
DigThatData t1_j9nlrii wrote
Reply to comment by JackBlemming in [D] "Deep learning is the only thing that currently works at scale" by GraciousReformer
just to be clear: i'm not saying neural networks don't scale, i'm saying they're not the only class of learning algorithm that scales.
dmart89 OP t1_j9nkm2u wrote
Reply to comment by noxiousmomentum in [D] Python library to collect structured datasets across the internet by dmart89
Fair. Thanks for your thoughts. I personally find constructing scrapers and parsing data annoyingly tedious, but it's probably just me (:
[deleted] t1_j9njspc wrote
Reply to [P] MIT Introduction to Data-Centric AI by anishathalye
[deleted]
chogall t1_j9nioqb wrote
Reply to comment by adt in [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
> but they were not really LLMs.
What's the definition of LLM?
noxiousmomentum t1_j9nil84 wrote
useless. what can easily be done needs no automation and what is hard to do isn't helped by this approach
sam__izdat t1_j9ni8rd wrote
Reply to comment by currentscurrents in [N] U.S. Copyright Office decides that Kris Kashtanova's AI-involved graphic novel will remain copyright registered, but the copyright protection will be limited to the text and the whole work as a compilation by Wiskkey
> They're still AI-assisted
the USCO has (correctly) repeatedly rejected copyright for the raw output of image generators, where you asked the computer to paint you a pretty picture
the parallels with photography are tenuous at best, and it's not about effort but rather the total absence of creative involvement -- it's less photography and more "I found this on google image search" except your database is the model's latent space
it is a good thing that they elected to forego a radical expansion of the already nightmarish, bloated IP regime, where being first-to-access would have granted users (not artists) a blackstonian property right to the results of a text query
i don't need whoever's hoarding the most compute to mine the commons and automatically pump out self-generating, legally-enforceable NFTs, at an industrial scale, in perpetuity... the world has enough parasites as it is, without a new clan of digital landlords, thank you
wjldw12138 t1_j9ni2gq wrote
Reply to [D] Simple Questions Thread by AutoModerator
Hi everyone, I am looking for something like CLIP in speech area, which could measure the distance between text and speech (Mel-spectrum).
I found speech-CLIP before but unfortunately, its input for speech is raw wave rather than Mel-spectrum (same with HuBERT). I would be so appreciate if you can provide some information about that!
[deleted] t1_j9nhjtc wrote
[removed]
ihopeshelovedme t1_j9nhhgs wrote
Reply to comment by chinguetti in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501
You think the r/singularity will be kind enough to grant him a Nobel price?
DevarshTare OP t1_j9ngofc wrote
Reply to comment by ggf31416 in [D] What matters while running models? by DevarshTare
I've seen the same across multiple threads now, the VRAM does make a difference in being able to run a model or having to optimize it. This has been really helpful, thanks a lot guys!
sam__izdat t1_j9ngakh wrote
Reply to comment by vyasnikhil96 in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
I don't have any technical criticism that would be useful to you (and frankly it's above my pay grade), but to expand on what I meant when I said that it's a game of calvinball, there's some history here worth considering. Copyright has gone through myriad justifications.
If we wanted to detect offending content by the original standards of the Stationers' Company, then it may be useful to look for signs of sedition and heresy, since the stated purpose was "to stem the flow of seditious and heretical texts."
By the justification of the liberals who came after, typesetting, being a costly and error-prone process, forced their hand to protect the integrity of the text. So, if for some reason we wanted to take that goal seriously, it might make sense to look for certain kinds of dissimilarity instead: errors and distortions in reproductions. After all, that was the social purpose of the monopoly right.
If the purpose of the copyright regime today is to secure the profits of private capital in perpetuity, then simple metrics of similarity aren't enough to guarantee a virtual Blackstonian land right either.
For example:
> In our discussions, we refer to C ∈ C abstractly as a “piece of copyrighted data”, but do not specify it in more detail. For example, in an image generative model, does C correspond to a single artwork, or the full collected arts of some artists? The answer is the former. The reason is that if a generative model generates data that is influenced by the full collected artworks of X, but not by any single piece, then it is not considered a copyright violation. This is due to that it is not possible to copyright style or ideas, only a specific expression. Hence, we think of C as a piece of content that is of a similar scale to the outputs of the model.
That sounds reasonable. Is it true?
French and Belgian IP laws, for example, consider taking an original photo of a public space showing protected architecture a copyright violation. Prior to mid 2016, taking a panoramic photo with the Atomium in the background was copyright infringement. Distributing a night photo of the Eiffel tower is still copyright infringement today. So, how would you guarantee that a diffusion model fall within the boundaries of arbitrary rules when those tests of "substantial similarity" suddenly become a lot more ambiguous than anticipated?
guillaumekln t1_j9nfl9t wrote
Reply to [D] Faster Flan-T5 inference by _learn_faster_
You can also check out the CTranslate2 library which supports efficient inference of T5 models, including 8-bit quantization on CPU and GPU. There is a usage example in the documentation.
Disclaimer: I’m the author of CTranslate2.
thecodethinker t1_j9nf1dk wrote
Reply to comment by athos45678 in [P] MIT Introduction to Data-Centric AI by anishathalye
Yep. Generating and properly preprocessing datasets is always where I feel lost when working on a new project
adt t1_j9neq5w wrote
Reply to [D] 14.5M-15M is the smallest number of parameters I could find for current pretrained language models. Are there any that are smaller? by Seankala
There should be quite a few models smaller than 15M params. What's your use case? A lot of the 2022-2023 optimizations mean that you can squish models onto modern GPUs now (i.e. int8 etc.).
Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size of 79M params.
Have you found the BERT compression paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models:
https://arxiv.org/pdf/1909.11687.pdf
My table shows...
*looks at table*
Smallest seems to be Microsoft Pact, which was ~30M params. Ignore that! Transformer is supposed to be wide and deep, I suppose, so it makes sense...
Many of the text-to-image models use smaller LLMs.
Also check HF, they now have 130,000 models of different sizes (to Feb/2023):
Includes a tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2
And t5-efficient tiny ('has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16).'):
https://huggingface.co/google/t5-efficient-tiny
Edit: I thought of Anthropic's toy models, but they were not really LLMs. They did train a 10M model during scaling research (paper), but the model hasn't been released.
currentscurrents t1_j9n8in9 wrote
Reply to comment by GraciousReformer in [D] "Deep learning is the only thing that currently works at scale" by GraciousReformer
In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.
A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.
On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.
It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.
txhwind t1_j9n63wz wrote
Reply to comment by currentscurrents in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_
One of keys to intelligence is learning to forget noncritical information. I think it might be a weak point of large language model.
JackBlemming t1_j9n4bp2 wrote
Reply to comment by DigThatData in [D] "Deep learning is the only thing that currently works at scale" by GraciousReformer
This is true. Netflix famously didnt use some complex neural net for choosing shows you'd like exactly because it didnt scale. Neural nets are expensive and if you can sacrifice a few percentages to save hundreds of millions in server fees, it's probably good.
currentscurrents t1_j9n3o7u wrote
Reply to comment by activatedgeek in [D] "Deep learning is the only thing that currently works at scale" by GraciousReformer
Sounds like ideally we'd want a model with good inductive biases for meta-learning new inductive biases, since every kind of data requires different biases.
vyasnikhil96 OP t1_j9mu7ak wrote
Reply to comment by sam__izdat in [R] Provable Copyright Protection for Generative Models by vyasnikhil96
Thanks for the interesting link! For the kind of copyright in your link deduplication of the data might be hard i.e. to assume that works which have access to this copyright only occur one or a few times, since it is a character and we don't really know which characters are copyrighted and which are not. But our paper assumes deduplication has been done beforehand.
Coming back to our notion, I am not trying to say that there is a already established information theoretic notion of copyright.
The copyright law (as we state in paper) relies on two things: 1. access to the copyrighted work must be proved and 2. substantial similarity to the copyrighted work must be established. Our notion cleanly separates these two and we come up with a information-theoretic way to quantify the "substantial similarity" aspect (denoted by k-NAF). For example, the strongest setting of k = 0 will never violate copyright (but also might degrade performance or be impossible to achieve) because it is equivalent to not having access. Larger values of k tradeoff between model performance and a possible increase in "substantial similarity". What k is valid for which setting to prevent copyright violation is not something we are establishing but rather that depends on the specific setting and must be determined by the law. The user can tune the value of k (assuming feasibility of the value) to the value considered acceptable by the law.
[deleted] t1_j9miyp3 wrote
[removed]
thomasahle OP t1_j9nprmt wrote
Reply to comment by activatedgeek in Unit Normalization instead of Cross-Entropy Loss [Discussion] by thomasahle
Great example! With Brier scoring we have
which is basically equivalent to replacing
logsumexpwithnorm^2in the first codeThis actually works just as good as my original method! The Wikipedia article for proper scoring functions also mention "Spherical score", which seems to be equivalent to my method of dividing by the norm. So maybe that's the explanation?
Note though that I applied Brier Loss directly on the logits, which is probably not how they are meant to be used...