New and Improved Embedding Model for OpenAI

evergreener · on Dec 15, 2022

Is it known to anyone how OpenAI (and others) are extending the context windows of things like ChatGPT so far? E.g. if you exceed 2048/8192 (subword) tokens, does the model just chunk the inputs and evaluate separately on the chunks? Is context/state maintained across chunks? I've never seen anyone actually explain this.

tshadley · on Dec 16, 2022

https://help.openai.com/en/articles/6787051-does-chatgpt-rem...

> While ChatGPT is able to remember what the user has said earlier in the conversation, there is a limit to how much information it can retain. The model is able to reference up to approximately 3000 words (or 4000 tokens) from the current conversation - any information beyond that is not stored.

This implies ChatGPT has a 4000 token maximum prompt and prior prompts in a given web session are inserted into the current prompt, most recent to oldest (probably with some sort of time context like "previously, user asked:"), up to 4000 tokens.

IanCal · on Dec 16, 2022

I've had longer discussions but I'm realising that I often ask for a summary, which would mean the model has a summary of the conversation so far in the window.

melony · on Dec 16, 2022

What's the technical limit? The width of the attention layer?

knome · on Dec 16, 2022

I have been playing with their completions API. I just keep track of previous conversational lines, and when I approach a configurable token threshold ( input+output must fit within the given number of tokens, and it returns it each time ), I just instruct chatgpt to summarize the conversation thus far with additional specific instructions to help it keep useful bits of context. I then make that summary part of the context I send in along with kept and future conversational lines.

Their API calls on the site have references to previous message ids, which makes me expect they're doing something similar.

binarymax · on Dec 15, 2022

I wonder if they tack on a neural network with a larger input set as preprocessing with convolutions or something beforehand, and pass those into the inputs of the large model. It’s something I’d try but I have no idea if that’s what they’re doing.

gok · on Dec 15, 2022

> Longer context. The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

8192 words is getting into the range of short stories or a masters thesis, which opens the door to some interesting applications.

drusepth · on Dec 15, 2022

Important to note that these tokens _can_ be words, but oftentimes a word will be comprised of multiple tokens, so 8192 tokens = 8192 words isn't strictly correct.

That said, your point stands. Most short stories are low-to-mid four-digit words, and a jump from 2048 tokens to 8192 squarely fits in that window.

As someone who's been working on multi-layered approaches to using GPT-like models for long text generation (e.g. synopsis -> outline -> paragraph expansions) to get around the limited context window, it'll be interesting to see if people will keep working towards that end or if it'll all become a moot point as the effective context window continues to scale up.

janalsncm · on Dec 16, 2022

Do we know exactly what the GPT3 tokenizer is? It could be combining multiple words together, for example common bigrams.

Never mind, it looks like they have a tokenizer tool online and every bigram I’ve given it BPEs to multiple tokens:

https://beta.openai.com/tokenizer

minimaxir · on Dec 15, 2022

It's worth noting that OpenAI mixed it up and is not using the GPT-2 tokenizer this time.

It's unclear what tokenizer they are using and the documentation is being coy about it. It could be a more efficient or a less efficient tokenizer.

bryanh · on Dec 16, 2022

https://twitter.com/sherwinwu/status/1603522777968832512 appears to include it.

minimaxir · on Dec 16, 2022

So it does.

The code there implies cl100k_base has a vocab size of 100k (I guess it's in the name lol) which means it is more comprehensive than GPT-2's 50k, so fewer tokens will be necessary.

BaculumMeumEst · on Dec 15, 2022

The documentation mentions that embeddings measure the relatedness of text strings, and that they’re commonly used for search, clustering, anomaly detection, recommendations, diversity management, and classification.

I’m not well versed in AI at all. Could anyone give some more fleshed out examples of some of the kinds of data that are fed into a tool like this, what the tool does with it, and what kinds of applications people might make with it?

jncraton · on Dec 16, 2022

You can use any sort of natural language text as input. The system transforms the input text into a multidimensional vector that roughly represents the semantic meaning of the input.

If you do this for many different inputs, you can get representations for each of them and store them in a database alongside the inputs. From there, you can use traditional methods to search for nearby vectors to efficiently search by semantic meaning.

One earlier related example is word2vec from 2013. This tool transforms individual words into embeddings, but is conceptually similar. Wikipedia has a decent overview that might be helpful:

https://en.wikipedia.org/wiki/Word2vec

That work demonstrated the utility of transforming meaning into vector space for search but also for basic semantic reasoning. For example, you could perform an operation like "brother - man + woman" on the embedding and the result would be an embedding very close to "sister".

whakim · on Dec 16, 2022

These sentence embeddings transform chunks of text into n-dimensional vectors. The key feature of these kinds of embeddings is that semantically similar pieces of text produces similar vectors (even when the words themselves aren't the same!) Once you have that, you can do all kinds of things. To take a few of the use cases you listed, imagine that you had a corpus of documents (perhaps responses to a survey, or data you scraped/pulled from a service like Twitter or Reddit, etc.). You can now cluster those responses (find groups of vectors that are close to each other in n-dimensional space); you can find outliers (find the vectors that are far from others); build recommendation engines (find vectors mathematically similar to vectors representing a user's previous activity), etc.

charcircuit · on Dec 16, 2022

For example say you own an online store with thousands of products and you want to implement search. What you can do is for each product you can feed the name and description into the model to get an embedding. When people type in a search query you convert their query into an embedding and then find the 10 closest embeddings.

minimaxir · on Dec 15, 2022

The documentation contains a good amount of hypothetical real-world use cases: https://beta.openai.com/docs/guides/embeddings/use-cases

boywitharupee · on Dec 16, 2022

I wonder if you can feed email threads and get insights by asking questions, "What time is my upcoming flight?" or "Where is Lucy's wedding location?"

fzliu · on Dec 16, 2022

Solid work OpenAI, though I'd definitely like to see some more benchmarks on a wider variety of datasets in addition to the ones listed in the post. Regardless, it's good to see embeddings becoming more and more mainstream and easier to leverage out-of-box. We tried image embeddings many moons ago (2015) with AlexNet trained across a custom dataset, but we still had to add quite a few custom roles post-inference.

A large selling point for ada-002 embeddings seems to be the reduced dimensionality. While lower-dimensional embeddings definitely help performance, I would say it's still highly dependent on the index that's being used. Graph- and tree-based indexes will benefit less than ones based on IVF (https://zilliz.com/blog/vector-index), as they do fewer overall distance computations during query time, but the speedup should still be significant.

Still been meaning to try semantic search across Wikipedia via text embeddings. Will definitely play around with OpenAI + Milvus (https://github.com/milvus-io/milvus).

iandanforth · on Dec 16, 2022

Would love to hear about this test. Also consider Cohere and Weaviate in the mix.

minimaxir · on Dec 15, 2022

The key feature here is the cost: compared to the same model size from the previous iteration (ada), it's 1/10th the cost, and I'm curious how they achieved that.

It's low enough for quick experimentation and real-time usage, and I have a few fun tests I can do with it...

gamegoblin · on Dec 16, 2022

Their existing embedding offerings were ridiculously expensive and underpowered compared to free, open source models that you could run locally on your laptop.

E.g. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... outperformed their best model, can run locally for free, and uses a smaller vector space. Strictly dominates.

For a full write-up on how bad it was, see https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

OpenAI is still lagging behind, but not as much, with this new offering.

yenniejun111 · on Dec 25, 2022

to be fair, the medium article you linked is from Jan 2022 -- before OpenAI released the new Ada 002 embeddings.

minimaxir · on Dec 16, 2022

There's a few caveats with SentenceTransformers that impact my ideas, so I am willing to give OpenAI a try.

IanCal · on Dec 15, 2022

Once I've got embeddings my naïve next step would be to do cosine similarity for comparisons/search/anything that requires a distance. I see they do that in some examples.

Is that the standard approach these days? Are there newer default approaches that tend to work better?

minimaxir · on Dec 15, 2022

Even better, the vectors come pre-unit normalized so you can just do the dot product for cosine similarity.

visarga · on Dec 15, 2022

You can also apply clustering or train a classification model based on embeddings.

bcjordan · on Dec 15, 2022

The embeddings/search API seem super powerful, been meaning to play around with it more. I wonder how its performance compares to ElasticSearch / other text search/classification offerings out there

gk1 · on Dec 15, 2022

The Search and Classification (and Answers) APIs were deprecated last week.[1]

They were never in serious competition with Elastic, as far as search goes. If you wanted to build a semantic search application using OpenAI embeddings, the more common (and scalable) method is to index those embeddings in a vector database like Pinecone.[2] In fact that's what OpenAI recommends to anyone who needs to transition off their Search API.

[1] https://help.openai.com/en/articles/6272952-search-transitio...

[2] https://docs.pinecone.io/docs/openai

thirdtrigger · on Dec 15, 2022

Agreed – one can also use Weaviate which comes with an OOTB OpenAI module leveraging the embeddings end-point https://weaviate.io/developers/weaviate/current/retriever-ve...

lmeyerov · on Dec 16, 2022

I'm curious why you say 'more common'. When we did some adoption analysis, we found elastic vector indexes to have much more traction than others (excluding say faiss), even if it's a small % of elastic's advertising. How did you come to finding pinecone is the more common method?

gk1 · on Dec 16, 2022

I meant OpenAI + Vector Database as more common than OpenAI Search API. And then Pinecone as an example of a vector database.

lmeyerov · on Dec 16, 2022

Ah yes, thanks!

Yeah we saw faiss + es leaders for serving embeddings / vector search, and pinecone / weaviate / I think milvus as next tier, so was curious if we could improve the analysis :)

andre-z · on Dec 16, 2022

FAISS is depreciating since it is just a library that does not scale and lacks a lot of vector db functionalities like filtering. ES is often used just by default because Devs have experience with but the performance is very low compared to dedicated solutions. The Github trending for Vector Databases https://github.com/topics/vector-database

gk1 · on Dec 16, 2022

Is the analysis public? I'm curious how you determined the popularity of a product like Pinecone, since we don't have a public metric like GitHub stars.

lmeyerov · on Dec 16, 2022

Loosely measure search, CVs, and job ads: https://gradientflow.com/the-vector-database-index/

This kind of analysis is rarely precise, but is useful for rougher tasks like tiering

My personal question is if vector indexes are/will be a good-enough general DB feature / compute lib for most users & use cases. A lucrative niche market can still happen as VC dollars disappear, similar to graph DBs, so not a knock, just important for folks deciding how to build things.

CaptainNegative · on Dec 16, 2022

Elastic/OpenSearch have classically used BM25, but the latter recently added semantic search capabilities using neural embeddings in 2.4. Not sure about the former.

[1] https://opensearch.org/blog/opensearch-2-4-is-available-toda...

andre-z · on Dec 16, 2022

ES and OS are desperately slow because based on the lucene vector search index. A dedicated vector database like Qdrant will be always a better choice https://github.com/qdrant/qdrant

reschkek · on Dec 16, 2022

Do we have any idea why lucene vector search underperforms? As of lucene 9.1 (and elastic 8.4), it runs the same sort of filtered/categorical HNSW that qdrant runs (https://lucene.apache.org/core/9_1_0/core/org/apache/lucene/...). Qdrant's benchmarking code (https://github.com/qdrant/vector-db-benchmark/blob/9263ba/en...) does use the new filtered ann query with elastic 8.4, so it appears to be a fair benchmark. Why is lucene/elastic so much slower? Is it a rust vs. java thing? Or some memory management issues?

dr_dshiv · on Dec 15, 2022

What effect will this have for connecting concepts between books? Either through summarization or topic mapping?

Mizza · on Dec 15, 2022

Does anybody have any good resources on the nuts and bolts of embeddings (particularly multi-modal embeddings)? I've got some experiments I'd like to run.

gk1 · on Dec 15, 2022

For multi-modal check out OpenAI CLIP: https://www.pinecone.io/learn/clip/

Mizza · on Dec 15, 2022

Exactly what I'm looking for, thank you.

EDIT: I am immediately struck that I need vastly more data and compute than I can afford for my experiment.

lee101 · on Dec 15, 2022

Also check out the embedding model from https://text-generator.io

It supports some things that openAI can't do, it retrieves any linked images of web pages, analyses the images or images with text inside to help the embedding model

tibbar · on Dec 15, 2022

Note that this model is not able to write code.

Prompt: "Write a C program to RSA encrypt a string."

text-embedding-ada-002:

%-C C C C C C C C

C

C C

C

C C C

C

C C

C

C C

C

C C C

C

C C C C C C C

C C

C

C C C C C C C

C

C C C C C C C

C

C C C C C C

C C C aC C C

C

m m m m m m m m m m m m m m m m m m mm m C mC

(Note: a relatively short max length)

Compare text-davinci-003:

#include <stdio.h> #include<math.h>

// Utility function to find GCD of 'a' // and 'b' int gcd(int a, int b) { if (a == 0) return b; return gcd(b % a, a); }

// Function to find Modular Multiplicative // Inverse of 'a' under modulo 'm' int modInverse(int a, int m) { a = a % m; for (int x = 1; x < m; x++) if ((a * x) % m == 1) return x; }

// Function to compute a/b under modulo m int modDivide(int a, int b, int m) { a = a % m; int inv = modInverse(b, m); return (inv * a) % m; }

// Function to calculate (a^b)%c long long int power(int a, int b, int c) { int res = 1; a = a % c; while (b > 0) { if (b & 1) res = modDivide(res * a, 1, c); b = b >> 1; a = modDivide(a * a, 1, c); } return res; }

// C function for extended Euclidean Algorithm int gcdExtended(int a, int b, int x, int y) { if (a == 0) { x = 0; y = 1; return b; } int x1, y1; int gcd = gcdExtended(b%a, a, &x1, &y1); x = y1 - (b/a) x1; y = x1;

    return gcd;

}

// Function to generate public and private // keys void generateKeys(int p, int q, int e, int d, int n) { // n = pq n = p * q;

    // Computing phi(n) = (p-1)*(q-1) 
    int phi = (p - 1) * (q - 1); 
  
    // Choose an integer e such that 
    // e and phi(n) are co-prime 
    *e = 2; 
    while (*e < phi) 
    { 
        if (gcd(*e, phi)==1) 
            break; 
        else
            ++*e; 
    } 
  
    // Compute d such that it satisfies 
    // d*e = 1 + k * phi(n) 
    int k = 2;  // Fixed value 
    *d = (1 + (k * phi))/ *e;

}

// Encryption Function long int RSA_Encrypt(int msg, int e, int n) { // Cipher Text c = (msg ^ e) % n long int c = power(msg, e, n);

    return c;

}

// Decryption Function long int RSA_Decrypt(int c, int d, int n) { // Message msg = (c ^ d) % n long int msg = power(c, d, n);

    return msg;

}

// Driver program int main() { // Two random prime numbers int p = 3; int q = 7;

    // Message to be encrypted 
    int msg = 15; 
  
    // Encryption key (e, n) 
    int e, d, n; 
  
    // Calculate n and phi 
    generateKeys(p, q, &e, &d, &n); 
  
    // Encryption 
    long int c = RSA_Encrypt(msg, e, n); 
    printf("Encrypted message is: %d\n", c); 
  
    // Decryption 
    long int m = RSA_Decrypt(c, d, n); 
    printf("Original Message is: %d\n", m); 
  
    return 0; 
}

varunkmohan · on Dec 15, 2022

It's an embedding model so it generates vector embeddings not text generations. That's to be expected.