$0.12 per 1k completion tokens is high enough that it makes it prohibitively expensive to use the 32k context model. Especially in a chatbot use case with cumulative prompting, which is the best use case for such a large context vs. the default cheaper 8k window.
In contrast, GPT-3.5 text-davinci-003 was $0.02/1k tokens, and let's not get into the ChatGPT API.
> Especially in a chatbot use case with cumulative prompting, which is the best use case for such a large context vs. the default cheaper 8k window.
Depends on what is up with the images and how they translate into tokens. I really have no idea, but could be that 32k tokens (lots of text) translates to only a few images for few-shot prompting.
The paper seems not to mention image tokenization, but I guess it should be possible to infer something about token rate when actually using the API and looking at how one is charged.
Currently, CLIP's largest size is at patch-14 for 336x336 images, which translates to 577 ViT tokens [(336/14)^2+1]. It might end up being token-efficient depending on how it's implemented. (the paper doesn't elaborate)
I would imagine most usecases for the 32k model have much longer prompts than completions, so the $0.06 per prompt token will be the real problem. I can't think of a usecase yet, but that might be because I haven't got a sense of how smart it is.
In contrast, GPT-3.5 text-davinci-003 was $0.02/1k tokens, and let's not get into the ChatGPT API.