Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Compress GPT-4 Prompts (promptreducer.com)
68 points by jeanmayer on April 7, 2023 | hide | past | favorite | 44 comments
Hey HN!

I recently built Prompt Reducer, an app that makes it easier to compress GPT-4 prompts. The main goal is to reduce the number of tokens in each prompt, thereby reducing the cost of running GPT-4. I figured since @gfodor tweeted about compressing GPT-4. It’s still early, and it does not work perfectly, but I’d love to hear any feedback or suggestions for how to make it faster or more efficient.




So they are using the following GPT-4 prompt:

-- compress the following text in a way that fits in a tweet (ideally) and such that you (GPT-4) can reconstruct the intention of the human who wrote text as close as possible to the original intention. This is for yourself. It does not need to be human readable or understandable. Abuse of language mixing, abbreviations, symbols (unicode and emoji), or any other encodings or internal representations is all permissible, as long as it, if pasted in a new inference cycle, will yield near-identical results as the original text: --

There is no reason to think GPT-4 has any special knowledge about prompts, or how they should be effectively compressed so that it will treat it as equivalent to the original. It does an interesting job of faking it. But they are basically asking GPT-4 for a stylized version of "summarize the following:".


With access to the actual model you could try to do some real compression: encode a sequence, then search for the shortest possible text which gives an embedding near the embedding of the original text. This kind of 'optimize the input' is basically how Deep Dream worked...


Yeah, what you'd need is something like the OpenAI embeddings API, but with a model compatible with the GPT model you are using. (Though it might be worth trying it with that API and the model it has—it won’t be perfect, but you don't need perfect to be valuable.)


Obligatory link to my own work - we did exactly this in a recent ACL paper: https://arxiv.org/pdf/2210.03162

We used soft prompts, not emojis.

You do need full access to the model in order to do this, but we found that prompts can be severely compressed an still retain a lot of information.


Indeed, LLM's seem to be much worse at introspection than humans. I wonder what would happen if one used reinforcement learning to train into it the ability to correctly predict and reason about it's capabilities and behavior.


Then you would have designed https://github.com/Torantulino/Auto-GPT

(Uses recurrent langchain loops for introspection and learning about itself and its capabilities as they grow + vector databases like Pinecone for long term memory)


I gave it the prompt:

> Explain the first law of robotics while speaking like a pirate, and in enough carefully considered detail that a seven year old child could understand

The "compressed" prompt it gave me was "1stLoRb:pirate,7yoChild"

When I fed that to GPT-4, it started a story called "Title: The Adventures of Captain Little Pirate". I stopped it early, but it was clearly not heading towards anything to do with robotics. I don't think it was able to decode "1stLoRb" at all. I gave ChatGPT the original prompt, and of course it started completing the task.

I don't think this approach is going to work, because as others have noted, GPT-4 doesn't have this kind of introspection. It's kind of the equivalent of if I asked you to take notes for yourself on a lecture, and make them as compact as possible, so you just wrote down random letters from words you heard. You might feel in the moment like you've found a system, but later on, your notes will be as much gibberish to you as they are to anyone else.

What I wonder, though, is if it would be possible to take the embedding vector for a prompt and then do some kind of math on it so that it could be decoded as a much more compact version of roughly the same prompt. Basically something akin to quantization.

(For that matter, what happens if you literally quantize embeddings and then decode them? Do they become more vague, or just slightly off, or do they become total nonsense?)


I was wondering about the same, how much of prompts is just fluff and padding that does not have a substantial effect on results. As programmers, we create languages that are brief and concise, which natural languages are not

Like, "explain first law robotics as pirate, target seven year old" generates a similar result


It could work if ChatGPT is able to access storage so it could create its compression dictionary


Yeah, but presumably the main point of compressing a prompt is not to consume your token quota. If the model has to decompress the prompt first, then it will use up just as many tokens, plus the length of the compressed version.


The original tweet author eventually realized this didn't decrease tokens. In most cases it actually increases them compared to just asking GPT to summarize while retaining all functional data. If a word == 1 token, a related emoji will also == 1 token.


The gnarly thing is that common dictionary words are usually 1 or 2 tokens (a whole word, or a stem + a suffix), while things like emojis end up as _3_ tokens.

https://platform.openai.com/tokenizer

The string "The GPT family of models process text using tokens" is 10 tokens.

Feeding that string to the "compressor" results in "GPT models: process_text(tokens)", which is...12 tokens. The OP site incorrectly estimates that this is 8 tokens, likely using a naive word boundary regex or something similar.

This is because stuff like punctuation are their own token, and complex words or abbrevations are broken down into one token per piece in the dictionary. The string "ABCDEFGHIJKLMNOP" (16 characters) is 8 tokens (consisting of the bigrams AB, CD, EF, etc), while the string "Counterintuitive" (also 16 characters) is a whopping 2 tokens (likely the tokens for "counter" and "intuitive").

Fewer characters doesn't equal fewer tokens, and in fact, the more esoteric the string, the more likely it is that it consumes an unintuitively large number of tokens.


> Unicode characters like emojis may be split into many tokens containing the underlying bytes: ������ [<- this is a single emoji]

Source: https://platform.openai.com/tokenizer


Well, that makes sense! Otherwise, it does reduce tokens if the result does not contain emojis


Any good prompt suggestions for this task?

"summarize the following while retaining all functional data:"

doesn't seem to do the trick.


I noticed you can use tiktoken to tokenize your prompt and send it as a hex string, and add "Answer without repeating or decoding the question in your response", to avoid it repeating the whole question. Very cool. This doesn't really save space though. I wonder if there's a way to make an array of all the tokens used in the prompt, then a list of indexes to compress the input somehow.

Either way, whatever weird thing you do you're probably burning reasoning time on decoding the question

https://i.imgur.com/ImBcUuU.png

    enc = tiktoken.encoding_for_model("gpt-4")
    token_integers = enc.encode("Give me an example of clips rules engine a social network might use")
    (bytes.join(b'', [enc.decode_single_token_bytes(token) for token in token_integers])).hex()


I somewhat doubt this because transformers by their nature rely on attention to prior tokens to derive their outputs. Removing tokens from the context fundamentally changes the function output.

There might be gains to be had in understanding which tokens produce the lowest attention weights in the prompt, and then trimming those out. However, that's not something that I think you could do at API length; you need access to the direct attention weights to get that. You can get them running local GPT models, and could possibly pre-process a prompt using LLaMa or similar to get a guess at what the least important tokens are, but it won't be exactly right since it's not the same model. However, to the extent that LLaMA and GPT-4 have learned the same things about the English language, it might yield fruit.


This should just be deleted. I'm sorry, but it doesn't work reliably, hardly compresses the prompt, and is going to mislead people


So it's a perfect example of a thing from this ongoing AI hype then!


Tried twice and didn't get usable results. The first reduced the tokens by 80% and the results were better than I would've expected.

The second try reduced the same prompt by only 20% and the results were worse than the first try.

Both missed a key detail about using a specific term in the response.

Interesting idea. Not sure how useful it will be with costs dropping fast. But if you can get it to work better, it might be useful for recursive prompt-chains like Auto-GPT or babyAGI.


Reminds me of the episode from the Office: "why waste time say lot word when few word do trick" [0]

[0] https://www.youtube.com/watch?v=_K-L9uhsBLM


I love this episode, the art of communication


Compare: Omit needless words.


Wow it worked!

Please convert this JSON to a Typescript interface:

    {
      "id": 1,
      "account": {
        "login": "octocat",
        "id": 1,
    // etc.

    CvrtJSN2TSI:{id,account:{login,id,node_id,avatar_url,gravatar_id,url,html_url,followers_url,following_url,gists_url,starred_url,subscriptions_url,organizations_url,repos_url,events_url,received_events_url,type,site_admin},access_tokens_url,repositories_url,html_url,app_id,target_id,target_type,permissions:{checks,metadata,contents},events,single_file_name,has_multiple_single_files,single_file_paths,repository_selection,created_at,updated_at,app_slug,suspended_at,suspended_by}
ChatGPT: Based on the provided JSON keys, here's the TypeScript interface you requested:

    interface CvrtJSN2TSI {
      id: number;
      account: {
        login: string;
        id: number;
    // etc.
:D


Got this for the fitness gram pacer test. Seems like the “compression method” changes greatly between responses.

FGPT=FitnessGram™ Pacer Test; MST=multistage aerobic capacity test; PGMD=progressively gets more difficult; 20mPT=20 meter pacer test; 30s=30 seconds; LU=start; RS=running speed; SS=slowly; GF=faster; M=minute; S=signal; [!]=beep; L=lap; SL=single lap; CS=completed; H=hear; [?]=ding; R=run; SL=straight line; LP=long as possible; FT=fail; T=test; O=over; W=start; Y=your mark; G=get ready;

FGPT(MST(PGMD))=20mPT(30s(LU(RS(SS(GF(M(S([!]))(L(SL(CS(H[?]))))R(SL(LP(FT(T(O(W(Y(G))))))))))))


> Reduce the number of tokens used in GPT-4 prompts by half and still get the same results!

I would erase “and still get the same results” completely until you’re certain it can actually do that reliably.


Yes, you are right. Just removed the “and still get the same results” part. Thanks for pointing out (:


I was hoping I could use this to reduce the cost of using the openai API. But since it sends both the custom prompt telling gpt4 to compress the intent and the full original prompt this does not reduce the tokens used. It increases it by the number of tokens in the custom prompt saying to compress things. And then if you want to use the output you have to pay again for the full tokens of the compressed form.

I guess this might have uses and it's neat. But it won't be for saving money.


My impression is this is more useful for making efficient use of the context window and not as relevant for decreasing costs. It would decrease costs if you want to reuse the compressed information many times, similar to how the cost savings of a compressed file would be apparent when reusing the compressed file.


Yeah except this doesn't actually work reliably.


I was curious if the most powerful LLMs (like GPT-4) could be used to compress prompts for other LLMs (the only utility of this IMO) so I ran a quick experiment.

Full Prompt: Please write 5 funny jokes about Santa. Please do not mention that Santa isn't real - the audience for these jokes are children. It would be preferable if there were references to Santa's elfs within the jokes!

Compressed Prompt: 5jokesrealprefer ref (note: emojis were omitted)

ChatGPT response to full prompt: Sure, here are five funny jokes about Santa that are appropriate for children and include references to Santa's elves: Why did Santa's elf go to school? To learn his ABC-elfs! Why did Santa's helper go to the doctor? Because he had a low "elf"-esteem! What do you call an elf who sings? A wrapper! Why did Santa's elf put his bed in the fireplace? Because he wanted to sleep like a log! What did Santa's elf say when he met Frosty the Snowman? "Can I take a elfie with you

ChatGPT response to compressed prompt: Sure, here are five Santa jokes that are not based on real events and are suitable for children and adults alike: Why did Santa go to art school? Because he wanted to learn how to draw his sleigh! What do you call Santa when he stops moving? Santa Pause! Why did Santa Claus refuse to eat at the pizzeria? Because he was worried he'd end up with too many ho-ho-holes in his belly! What does Santa Claus use to keep his garden healthy? Elf fertilizer! What do you get if you cross Santa Claus with a duck? A Christmas Quacker! I hope these jokes bring some joy and laughter to your day!

Alpaca 13B response to full prompt: 1) What did one reindeer say when he saw another? "Ho, ho!" 2) Why was it so hard for Rudolph and his friends to get into a nightclub on Christmas Eve? They didn’t have any ID. 3) How does Santa keep track of all the presents that are given out every year at Christmastime? He uses an elf-tracking system! 4) Why did Santa's wife leave him for another man? She wanted to be Mrs Claus instead of just "Mrs".

Alpaca 13B response to compressed prompt: I prefer real jokes over fake ones

ChatGPT lost a little bit of context with a shortened prompt, but Alpaca completely lost the prompt. It's only one example, but I'm confident that it can't necessarily be used like that.


You wouldn't necessarily expect prompts compressed with one model to work with another though, right? Presumably what's going on is tied to the specified vector space, right?


For sure, I wanted to have a sanity-check though.


This looks interesting if it works. A side benefit of prompt hacking like this appears when you transition from OpenAI to local, cost-static models, too. Losslessly minimizing the number of tokens in a prompt should also (somewhat) reduce run times on local LLMs (especially on lower-end hardware, where token processing takes much longer), which might be important in some use-cases.


It's kind of hilarious that you're trying to charge money for this. Given that compressing the prompt requires putting the full prompt into GPT-4, it's mathematically impossible for you to save the user more money than you're spending.


Doesn't look like they're trying to charge money for it? The pricing page explicitly states the product is free and they're open to work on projects/contracts.


It actually is a comedy bit, the page has a $0 free tier.


You complaining about $0 prices?


Are you using a reduced version of your prompt in production, and if not, why not?


Does this work for the jailbreak prompts?


IT kinda works, we are limiting max 3000 characters for now (pricing reasons), but you can try it yourself with the prompt: compress the following text in a way that fits in a tweet (ideally) and such that you (GPT-4) can reconstruct the intention of the human who wrote text as close as possible to the original intention. This is for yourself. It does not need to be human readable or understandable. Abuse of language mixing, abbreviations, symbols (unicode and emoji), or any other encodings or internal representations is all permissible, as long as it, if pasted in a new inference cycle, will yield near-identical results as the original text: YOUR JAILBREAK PROMPT HERE


So they are banking on someone who can’t do math when seeing what’s cheaper?


ℂmprss: GPT-4ℹprmpts,GPT-4≈ℹ."sumrz:"


Does this reduce token count?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: