visarga t1_jaj4bqs wrote on March 1, 2023 at 8:21 PM

Reply to comment by harharveryfunny in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> $1.5M/yr

The inference cost is probably 10% of that.

jturp-sc t1_jaj45ek wrote on March 1, 2023 at 8:20 PM

Reply to comment by JackBlemming in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

The entry costs have always been so high that LLMs as a service was going to be a winner-take-most marketplace.

I think the best hope is to see other major players enter the space either commercially or as FOSS. I think the former is more likely, and I was really hoping that we would see PaLM on GCP or even something crazier like a Meta-Amazon partnership for LLaMa on AWS.

Unfortunately, I don't think any of those orgs will pivot fast enough until some damage is done.

Timdegreat t1_jaj3gpr wrote on March 1, 2023 at 8:15 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Will we be able to generate embeddings using the ChatGPT API?

jturp-sc t1_jaj2w4j wrote on March 1, 2023 at 8:12 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Glad to see them make ChatGPT accessible via API and go back to update their documentation to be more clear on which model is which.

I had an exhausting number of conversations with confused product managers, engineers and marketing managers on "No, we're not using ChatGPT".

bo_peng OP t1_jaj2pr2 wrote on March 1, 2023 at 8:11 PM

Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

strange. all spaces are lost even when i add 4 spaces in front of all code lines

UPDATE: works in markdown editor :)

LetterRip t1_jaj1kp3 wrote on March 1, 2023 at 8:04 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> I have no idea how OpenAI can make money on this.

Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.

A*.3/3 = 10% of the cost.

Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.

So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.

edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation

https://github.com/microsoft/DeepSpeed-MII

Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.

lostmsu t1_jaj0dw2 wrote on March 1, 2023 at 7:56 PM

Reply to comment by Educational-Net303 in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

I would love an electricity estimate for running GPT-3-sized models with optimal configuration.

According to my own estimate, electricity cost for a lifetime (~5y) of a 350W GPU is between $1k-$1.6k. Which means for enterprise-class GPUs electricity is dwarfed by the cost of the GPU itself.

KerfuffleV2 t1_jaiz1k8 wrote on March 1, 2023 at 7:48 PM

Reply to comment by bo_peng in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Unfortunately, that doesn't work on the old reddit layout. We just see a garbled mess.

Here's a fixed version of the code/examples:

(not my content)

Example:

'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32' = first 10 layers on cuda:0 fp16, then 8 layers on cuda:1 fp16, then on cpu fp32

'cuda fp16 *20+' = first 20 layers on cuda fp16, then stream the rest on it

os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '0' #  if '1' then compile CUDA kernel for seq mode (much faster)
from rwkv.model import RWKV

from rwkv.utils import PIPELINE, PIPELINE_ARGS
pipeline = PIPELINE(model, "20B_tokenizer.json") # find it in https://github.com/BlinkDL/ChatRWKV

# download models: https://huggingface.co/BlinkDL
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-169m/RWKV-4-Pile-169M-20220807-8023', strategy='cpu fp32')

ctx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in     Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')
def my_print(s):
    print(s, end='', flush=True)

# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-details
args = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7,
                     alpha_frequency = 0.25,
                     alpha_presence = 0.25,
                     token_ban = [0], # ban the generation of some tokens
                     token_stop = []) # stop generation whenever you see any token here
pipeline.generate(ctx, token_count=512, args=args, callback=my_print)

I kind of want to know what happens in the story...

bo_peng OP t1_jaixxp5 wrote on March 1, 2023 at 7:41 PM

Reply to comment by satireplusplus in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Thank you :) I was using the markdown mode instead because I didn't know this

satireplusplus t1_jaiwxlo wrote on March 1, 2023 at 7:35 PM

Reply to [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Wow, nice, I will try it out!

Btw: If you want to format your code in your post, you need to add 4 spaces in front of any line in your post. Otherwise all newlines are lost.

Lines starting with four spaces are treated like code:

if 1 * 2 &lt; 3:
    print("hello, world!")

JackBlemming t1_jaisvp4 wrote on March 1, 2023 at 7:09 PM

Reply to comment by Educational-Net303 in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Definitely. This is so they can become entrenched and collect massive amounts of data. It also discourages competition, since they won't be able to compete against these artificially low prices. This is not good for the community. This would be equivalent to opening up a restaurant and giving away food for free, then jacking up prices when the adjacent restaurants go bankrupt. OpenAI are not good guys.

I will rescind my comment and personally apologize if they release ChatGPT code, but we all know that will never happen, unless they have a better product lined up.

harharveryfunny t1_jairuhd wrote on March 1, 2023 at 7:02 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

It says they've cut their costs by 90%, and are passing that saving onto the user. I'd have to guess that they are making money on this, not just treating it as a loss-leader for other more expensive models.

The way the API works is that you have to send the entire conversation each time, and the tokens you will be billed for include both those you send and the API's response (which you are likely to append to the conversation and send back to them, getting billed again and again as the conversation progresses). By the time you've hit the 4K token limit of this API, there will have been a bunch of back and forth - you'll have paid a lot more than 4K * 0.2c/1K for the conversation. It's easy to imagine chat-based API's becoming very widespread and the billable volume becoming huge. OpenAI are using Microsoft Azure compute, who may see a large spike in usage/profits out of this.

It'll be interesting to see how this pricing, and that of competitors evolves. Interesting to see also some of OpenAI's annual price plans outlined elsewhere such as $800K/yr for their 8K token limit "DV" model (DaVinci 4.0?), and $1.5M/yr for the 32K token limit "DV" model.

Educational-Net303 t1_jair4wf wrote on March 1, 2023 at 6:58 PM

Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Definitely a loss-leader to cut off Claude/bard, electricity alone would cost more than that. Expect a rise in price in 1 or 2 months

limpbizkit4prez t1_jai7l96 wrote on March 1, 2023 at 4:55 PM

Reply to comment by _Arsenie_Boca_ in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

It matters because the authors continue to increase model capacity to do better on a single task and that's it. They also determined that strategy, not the LLM. It would be way cooler if they constrained the problem to roughly the same number of parameters and showed generalization across multiple tasks. Again, it's neat, just not innovative or sexy.

_Arsenie_Boca_ t1_jai5zgz wrote on March 1, 2023 at 4:45 PM

Reply to comment by limpbizkit4prez in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

The final evaluation is done on test metrics right? If so, why does it matter?

currentscurrents t1_jai5dk2 wrote on March 1, 2023 at 4:41 PM

Reply to [D] What are the most known architectures of Text To Image models ? by AImSamy

Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.

There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.

cnapun t1_jai24sf wrote on March 1, 2023 at 4:20 PM

Reply to comment by SaltyStackSmasher in [D] backprop through beam sampling ? by SaltyStackSmasher

What I was trying to say was that doing this sampling approach (in a transformer) seems like it would have similar issues to a RNN, in that your computational graph will be repeated N times, where N is the rollout size. This makes me suspect that you'll get a lot of noise in your gradient estimates if N is large (also iirc Gumbel softmax gradients are biased, which might cause some more issues if chaining them)

RaeudigerRaffi t1_jahpbod wrote on March 1, 2023 at 2:55 PM

Reply to comment by RaeudigerRaffi in [D] backprop through beam sampling ? by SaltyStackSmasher

To add to this I thought a bit about it and technically in PyTorch, this should be possible to do with some trickery with custom autograd functions. You can probably sample with Gumbel Softmax and return the argmax. In the custom backward you can just skip the argmax part and backprop as if the Gumbel Softmax output has been returned and not the argmax on the Gumbel Softmax.

limpbizkit4prez t1_jahhmhd wrote on March 1, 2023 at 1:59 PM

Reply to comment by MysteryInc152 in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

Lol, I strongly disagree. There are already methods out there that provide architecture design. This is a "that's neat" type of project, but I'd be really disappointed to see this anywhere other than arxiv.

MysteryInc152 OP t1_jahgb2n wrote on March 1, 2023 at 1:48 PM

Reply to comment by limpbizkit4prez in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

Overfitting comes the necessary connotation that the model does not generalize well to instances of the task outside the training data.

As long as what the model creates is novel and works, "overfitting" seems like an unimportant if not misleading distinction.

limpbizkit4prez t1_jahaq8v wrote on March 1, 2023 at 1:00 PM

Reply to [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

The authors kept increasing model size until the model overfit the task. I'm not sure if that's high impact. It's cool and everything, but over fitting a data set is never really valuable.

Emergency_Apricot_77 t1_jah9rb7 wrote on March 1, 2023 at 12:51 PM

Reply to comment by Kaleidophon in [D] backprop through beam sampling ? by SaltyStackSmasher

Why go with BLEU though ? OP didn't particularly mention optimizing sequence level metrics. Can't we still use cross entropy ? Something as follows:

Sample first token, calculate cross-entropy with first token of gold

Sample second token, calculate cross-entropy with second token of gold

Sample third token, calculate cross-entropy with third token of gold

... and so on ?

This way we still have differentiable metric but we have a much better alignment between train and inference scenarios -- as opposed to current teacher forcing training and sampling inference -- which I thought the OP was going for.

[deleted] t1_jah7ias wrote on March 1, 2023 at 12:29 PM

Reply to comment by currentscurrents in Is there any model that classify singing and speaking? [R] by Stencolino

[deleted]

MysteryInc152 OP t1_jah5w3t wrote on March 1, 2023 at 12:12 PM

Reply to [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

>Given the recent impressive accomplishments of language models (LMs) for code generation, we explore the use of LMs as adaptive mutation and crossover operators for an evolutionary neural architecture search (NAS) algorithm. While NAS still proves too difficult a task for LMs to succeed at solely through prompting, we find that the combination of evolutionary prompt engineering with soft prompt-tuning, a method we term EvoPrompting, consistently finds diverse and high performing models. We first demonstrate that EvoPrompting is effective on the computationally efficient MNIST-1D dataset, where EvoPrompting produces convolutional architecture variants that outperform both those designed by human experts and naive few-shot prompting in terms of accuracy and model size. We then apply our method to searching for graph neural networks on the CLRS Algorithmic Reasoning Benchmark, where EvoPrompting is able to design novel architectures that outperform current state-of-the-art models on 21 out of 30 algorithmic reasoning tasks while maintaining similar model size. EvoPrompting is successful at designing accurate and efficient neural network architectures across a variety of machine learning tasks, while also being general enough for easy adaptation to other tasks beyond neural network design.

Between this and being able to generate novel functioning protein structures, i hope the "it can't truly create anything new!" argument for LLMs die but i'm sure we'll find more posts to move lol

RaeudigerRaffi t1_jah39t7 wrote on March 1, 2023 at 11:43 AM

Reply to comment by Kaleidophon in [D] backprop through beam sampling ? by SaltyStackSmasher

You are right Gumbel Softmax is a possibility with which you can backprop. But given that he is trying to do beam sampling and backprop through it at some point you need to argmax on your gumbel softmax vector in order to actually pick the token (assuming there is no way to work with the vector representations down the line correct me if i am wrong) and then this becomes not differentiable

Recent comments in /f/MachineLearning