Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

$0.12 per 1k completion tokens is high enough that it makes it prohibitively expensive to use the 32k context model. Especially in a chatbot use case with cumulative prompting, which is the best use case for such a large context vs. the default cheaper 8k window.

In contrast, GPT-3.5 text-davinci-003 was $0.02/1k tokens, and let's not get into the ChatGPT API.



I disagree that out of all possible use cases for a large context model that a chatbot is really the "best use case".


> $0.12 per 1k completion tokens is high enough that it makes it prohibitively expensive to use the 32k context model.

this is a lot. I bet there's a quite a bit of profit in there


> I bet there's a quite a bit of profit in there

Is this profit-seeking pricing or pricing that is meant to induce folks self-selecting out?

Genuine question — I don’t know enough about this area of pricing to have any idea.


Gotta pay back M$


> Especially in a chatbot use case with cumulative prompting, which is the best use case for such a large context vs. the default cheaper 8k window.

Depends on what is up with the images and how they translate into tokens. I really have no idea, but could be that 32k tokens (lots of text) translates to only a few images for few-shot prompting.

The paper seems not to mention image tokenization, but I guess it should be possible to infer something about token rate when actually using the API and looking at how one is charged.


Currently, CLIP's largest size is at patch-14 for 336x336 images, which translates to 577 ViT tokens [(336/14)^2+1]. It might end up being token-efficient depending on how it's implemented. (the paper doesn't elaborate)


I would imagine most usecases for the 32k model have much longer prompts than completions, so the $0.06 per prompt token will be the real problem. I can't think of a usecase yet, but that might be because I haven't got a sense of how smart it is.


can't you combine instances of 4k tokens in 3.5 to fake it? having one gpt context per code file, for instance and maybe some sort of index?

I'm not super versed on lang chain but that might be kinda what that solves...


LangChain/context prompting can theoetically allow compression of longer conversation, which will likely be the best business strategy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: