Recent comments in /f/MachineLearning

thomasahle OP t1_j9nprmt wrote

Great example! With Brier scoring we have

loss = norm(x)**2 - x[label]**2 + (1-x[label])**2
     = norm(x)**2 - 2*x[label] + 1

which is basically equivalent to replacing logsumexp with norm^2 in the first code

def label_cross_entropy_on_logits(x, labels):
    return (-2*x.select(labels) + x.norm(axis=1)**2).sum(axis=0)

This actually works just as good as my original method! The Wikipedia article for proper scoring functions also mention "Spherical score", which seems to be equivalent to my method of dividing by the norm. So maybe that's the explanation?

Note though that I applied Brier Loss directly on the logits, which is probably not how they are meant to be used...

2

Seankala OP t1_j9npctd wrote

Thanks for the detailed answer! My use case is that the company I work at currently uses image-based models for e-commerce purposes, but we want to use text-based models as well. The image-based model(s) are already taking up around 30-50M parameters so I didn't want to just bring in a 100M+ parameter model. Even 15M seems quite big.

5

sam__izdat t1_j9ni8rd wrote

> They're still AI-assisted

the USCO has (correctly) repeatedly rejected copyright for the raw output of image generators, where you asked the computer to paint you a pretty picture

the parallels with photography are tenuous at best, and it's not about effort but rather the total absence of creative involvement -- it's less photography and more "I found this on google image search" except your database is the model's latent space

it is a good thing that they elected to forego a radical expansion of the already nightmarish, bloated IP regime, where being first-to-access would have granted users (not artists) a blackstonian property right to the results of a text query

i don't need whoever's hoarding the most compute to mine the commons and automatically pump out self-generating, legally-enforceable NFTs, at an industrial scale, in perpetuity... the world has enough parasites as it is, without a new clan of digital landlords, thank you

3

wjldw12138 t1_j9ni2gq wrote

Hi everyone, I am looking for something like CLIP in speech area, which could measure the distance between text and speech (Mel-spectrum).

I found speech-CLIP before but unfortunately, its input for speech is raw wave rather than Mel-spectrum (same with HuBERT). I would be so appreciate if you can provide some information about that!

1

sam__izdat t1_j9ngakh wrote

I don't have any technical criticism that would be useful to you (and frankly it's above my pay grade), but to expand on what I meant when I said that it's a game of calvinball, there's some history here worth considering. Copyright has gone through myriad justifications.

If we wanted to detect offending content by the original standards of the Stationers' Company, then it may be useful to look for signs of sedition and heresy, since the stated purpose was "to stem the flow of seditious and heretical texts."

By the justification of the liberals who came after, typesetting, being a costly and error-prone process, forced their hand to protect the integrity of the text. So, if for some reason we wanted to take that goal seriously, it might make sense to look for certain kinds of dissimilarity instead: errors and distortions in reproductions. After all, that was the social purpose of the monopoly right.

If the purpose of the copyright regime today is to secure the profits of private capital in perpetuity, then simple metrics of similarity aren't enough to guarantee a virtual Blackstonian land right either.

For example:

> In our discussions, we refer to C ∈ C abstractly as a “piece of copyrighted data”, but do not specify it in more detail. For example, in an image generative model, does C correspond to a single artwork, or the full collected arts of some artists? The answer is the former. The reason is that if a generative model generates data that is influenced by the full collected artworks of X, but not by any single piece, then it is not considered a copyright violation. This is due to that it is not possible to copyright style or ideas, only a specific expression. Hence, we think of C as a piece of content that is of a similar scale to the outputs of the model.

That sounds reasonable. Is it true?

French and Belgian IP laws, for example, consider taking an original photo of a public space showing protected architecture a copyright violation. Prior to mid 2016, taking a panoramic photo with the Atomium in the background was copyright infringement. Distributing a night photo of the Eiffel tower is still copyright infringement today. So, how would you guarantee that a diffusion model fall within the boundaries of arbitrary rules when those tests of "substantial similarity" suddenly become a lot more ambiguous than anticipated?

3

adt t1_j9neq5w wrote

There should be quite a few models smaller than 15M params. What's your use case? A lot of the 2022-2023 optimizations mean that you can squish models onto modern GPUs now (i.e. int8 etc.).

Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size of 79M params.

Have you found the BERT compression paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models:

https://arxiv.org/pdf/1909.11687.pdf

My table shows...

https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit#gid=1158069878

*looks at table*

Smallest seems to be Microsoft Pact, which was ~30M params. Ignore that! Transformer is supposed to be wide and deep, I suppose, so it makes sense...

Many of the text-to-image models use smaller LLMs.

Also check HF, they now have 130,000 models of different sizes (to Feb/2023):

https://huggingface.co/models

Includes a tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2

And t5-efficient tiny ('has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16).'):

https://huggingface.co/google/t5-efficient-tiny

Edit: I thought of Anthropic's toy models, but they were not really LLMs. They did train a 10M model during scaling research (paper), but the model hasn't been released.

25

currentscurrents t1_j9n8in9 wrote

In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.

A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.

On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.

It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.

1

vyasnikhil96 OP t1_j9mu7ak wrote

Thanks for the interesting link! For the kind of copyright in your link deduplication of the data might be hard i.e. to assume that works which have access to this copyright only occur one or a few times, since it is a character and we don't really know which characters are copyrighted and which are not. But our paper assumes deduplication has been done beforehand.

Coming back to our notion, I am not trying to say that there is a already established information theoretic notion of copyright.

The copyright law (as we state in paper) relies on two things: 1. access to the copyrighted work must be proved and 2. substantial similarity to the copyrighted work must be established. Our notion cleanly separates these two and we come up with a information-theoretic way to quantify the "substantial similarity" aspect (denoted by k-NAF). For example, the strongest setting of k = 0 will never violate copyright (but also might degrade performance or be impossible to achieve) because it is equivalent to not having access. Larger values of k tradeoff between model performance and a possible increase in "substantial similarity". What k is valid for which setting to prevent copyright violation is not something we are establishing but rather that depends on the specific setting and must be determined by the law. The user can tune the value of k (assuming feasibility of the value) to the value considered acceptable by the law.

2