Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New and Improved Embedding Model for OpenAI (openai.com)
114 points by craigkerstiens on Dec 15, 2022 | hide | past | favorite | 46 comments


Is it known to anyone how OpenAI (and others) are extending the context windows of things like ChatGPT so far? E.g. if you exceed 2048/8192 (subword) tokens, does the model just chunk the inputs and evaluate separately on the chunks? Is context/state maintained across chunks? I've never seen anyone actually explain this.


https://help.openai.com/en/articles/6787051-does-chatgpt-rem...

> While ChatGPT is able to remember what the user has said earlier in the conversation, there is a limit to how much information it can retain. The model is able to reference up to approximately 3000 words (or 4000 tokens) from the current conversation - any information beyond that is not stored.

This implies ChatGPT has a 4000 token maximum prompt and prior prompts in a given web session are inserted into the current prompt, most recent to oldest (probably with some sort of time context like "previously, user asked:"), up to 4000 tokens.


I've had longer discussions but I'm realising that I often ask for a summary, which would mean the model has a summary of the conversation so far in the window.


What's the technical limit? The width of the attention layer?


I have been playing with their completions API. I just keep track of previous conversational lines, and when I approach a configurable token threshold ( input+output must fit within the given number of tokens, and it returns it each time ), I just instruct chatgpt to summarize the conversation thus far with additional specific instructions to help it keep useful bits of context. I then make that summary part of the context I send in along with kept and future conversational lines.

Their API calls on the site have references to previous message ids, which makes me expect they're doing something similar.


I wonder if they tack on a neural network with a larger input set as preprocessing with convolutions or something beforehand, and pass those into the inputs of the large model. It’s something I’d try but I have no idea if that’s what they’re doing.


> Longer context. The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

8192 words is getting into the range of short stories or a masters thesis, which opens the door to some interesting applications.


Important to note that these tokens _can_ be words, but oftentimes a word will be comprised of multiple tokens, so 8192 tokens = 8192 words isn't strictly correct.

That said, your point stands. Most short stories are low-to-mid four-digit words, and a jump from 2048 tokens to 8192 squarely fits in that window.

As someone who's been working on multi-layered approaches to using GPT-like models for long text generation (e.g. synopsis -> outline -> paragraph expansions) to get around the limited context window, it'll be interesting to see if people will keep working towards that end or if it'll all become a moot point as the effective context window continues to scale up.


Do we know exactly what the GPT3 tokenizer is? It could be combining multiple words together, for example common bigrams.

Never mind, it looks like they have a tokenizer tool online and every bigram I’ve given it BPEs to multiple tokens:

https://beta.openai.com/tokenizer


It's worth noting that OpenAI mixed it up and is not using the GPT-2 tokenizer this time.

It's unclear what tokenizer they are using and the documentation is being coy about it. It could be a more efficient or a less efficient tokenizer.



So it does.

The code there implies cl100k_base has a vocab size of 100k (I guess it's in the name lol) which means it is more comprehensive than GPT-2's 50k, so fewer tokens will be necessary.


The documentation mentions that embeddings measure the relatedness of text strings, and that they’re commonly used for search, clustering, anomaly detection, recommendations, diversity management, and classification.

I’m not well versed in AI at all. Could anyone give some more fleshed out examples of some of the kinds of data that are fed into a tool like this, what the tool does with it, and what kinds of applications people might make with it?


You can use any sort of natural language text as input. The system transforms the input text into a multidimensional vector that roughly represents the semantic meaning of the input.

If you do this for many different inputs, you can get representations for each of them and store them in a database alongside the inputs. From there, you can use traditional methods to search for nearby vectors to efficiently search by semantic meaning.

One earlier related example is word2vec from 2013. This tool transforms individual words into embeddings, but is conceptually similar. Wikipedia has a decent overview that might be helpful:

https://en.wikipedia.org/wiki/Word2vec

That work demonstrated the utility of transforming meaning into vector space for search but also for basic semantic reasoning. For example, you could perform an operation like "brother - man + woman" on the embedding and the result would be an embedding very close to "sister".


These sentence embeddings transform chunks of text into n-dimensional vectors. The key feature of these kinds of embeddings is that semantically similar pieces of text produces similar vectors (even when the words themselves aren't the same!) Once you have that, you can do all kinds of things. To take a few of the use cases you listed, imagine that you had a corpus of documents (perhaps responses to a survey, or data you scraped/pulled from a service like Twitter or Reddit, etc.). You can now cluster those responses (find groups of vectors that are close to each other in n-dimensional space); you can find outliers (find the vectors that are far from others); build recommendation engines (find vectors mathematically similar to vectors representing a user's previous activity), etc.


For example say you own an online store with thousands of products and you want to implement search. What you can do is for each product you can feed the name and description into the model to get an embedding. When people type in a search query you convert their query into an embedding and then find the 10 closest embeddings.


The documentation contains a good amount of hypothetical real-world use cases: https://beta.openai.com/docs/guides/embeddings/use-cases


I wonder if you can feed email threads and get insights by asking questions, "What time is my upcoming flight?" or "Where is Lucy's wedding location?"


Solid work OpenAI, though I'd definitely like to see some more benchmarks on a wider variety of datasets in addition to the ones listed in the post. Regardless, it's good to see embeddings becoming more and more mainstream and easier to leverage out-of-box. We tried image embeddings many moons ago (2015) with AlexNet trained across a custom dataset, but we still had to add quite a few custom roles post-inference.

A large selling point for ada-002 embeddings seems to be the reduced dimensionality. While lower-dimensional embeddings definitely help performance, I would say it's still highly dependent on the index that's being used. Graph- and tree-based indexes will benefit less than ones based on IVF (https://zilliz.com/blog/vector-index), as they do fewer overall distance computations during query time, but the speedup should still be significant.

Still been meaning to try semantic search across Wikipedia via text embeddings. Will definitely play around with OpenAI + Milvus (https://github.com/milvus-io/milvus).


Would love to hear about this test. Also consider Cohere and Weaviate in the mix.


The key feature here is the cost: compared to the same model size from the previous iteration (ada), it's 1/10th the cost, and I'm curious how they achieved that.

It's low enough for quick experimentation and real-time usage, and I have a few fun tests I can do with it...


Their existing embedding offerings were ridiculously expensive and underpowered compared to free, open source models that you could run locally on your laptop.

E.g. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... outperformed their best model, can run locally for free, and uses a smaller vector space. Strictly dominates.

For a full write-up on how bad it was, see https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

OpenAI is still lagging behind, but not as much, with this new offering.


to be fair, the medium article you linked is from Jan 2022 -- before OpenAI released the new Ada 002 embeddings.


There's a few caveats with SentenceTransformers that impact my ideas, so I am willing to give OpenAI a try.


Once I've got embeddings my naïve next step would be to do cosine similarity for comparisons/search/anything that requires a distance. I see they do that in some examples.

Is that the standard approach these days? Are there newer default approaches that tend to work better?


Even better, the vectors come pre-unit normalized so you can just do the dot product for cosine similarity.


You can also apply clustering or train a classification model based on embeddings.


The embeddings/search API seem super powerful, been meaning to play around with it more. I wonder how its performance compares to ElasticSearch / other text search/classification offerings out there


The Search and Classification (and Answers) APIs were deprecated last week.[1]

They were never in serious competition with Elastic, as far as search goes. If you wanted to build a semantic search application using OpenAI embeddings, the more common (and scalable) method is to index those embeddings in a vector database like Pinecone.[2] In fact that's what OpenAI recommends to anyone who needs to transition off their Search API.

[1] https://help.openai.com/en/articles/6272952-search-transitio...

[2] https://docs.pinecone.io/docs/openai


Agreed – one can also use Weaviate which comes with an OOTB OpenAI module leveraging the embeddings end-point https://weaviate.io/developers/weaviate/current/retriever-ve...


I'm curious why you say 'more common'. When we did some adoption analysis, we found elastic vector indexes to have much more traction than others (excluding say faiss), even if it's a small % of elastic's advertising. How did you come to finding pinecone is the more common method?


I meant OpenAI + Vector Database as more common than OpenAI Search API. And then Pinecone as an example of a vector database.


Ah yes, thanks!

Yeah we saw faiss + es leaders for serving embeddings / vector search, and pinecone / weaviate / I think milvus as next tier, so was curious if we could improve the analysis :)


FAISS is depreciating since it is just a library that does not scale and lacks a lot of vector db functionalities like filtering. ES is often used just by default because Devs have experience with but the performance is very low compared to dedicated solutions. The Github trending for Vector Databases https://github.com/topics/vector-database


Is the analysis public? I'm curious how you determined the popularity of a product like Pinecone, since we don't have a public metric like GitHub stars.


Loosely measure search, CVs, and job ads: https://gradientflow.com/the-vector-database-index/

This kind of analysis is rarely precise, but is useful for rougher tasks like tiering

My personal question is if vector indexes are/will be a good-enough general DB feature / compute lib for most users & use cases. A lucrative niche market can still happen as VC dollars disappear, similar to graph DBs, so not a knock, just important for folks deciding how to build things.


Elastic/OpenSearch have classically used BM25, but the latter recently added semantic search capabilities using neural embeddings in 2.4. Not sure about the former.

[1] https://opensearch.org/blog/opensearch-2-4-is-available-toda...


ES and OS are desperately slow because based on the lucene vector search index. A dedicated vector database like Qdrant will be always a better choice https://github.com/qdrant/qdrant


Do we have any idea why lucene vector search underperforms? As of lucene 9.1 (and elastic 8.4), it runs the same sort of filtered/categorical HNSW that qdrant runs (https://lucene.apache.org/core/9_1_0/core/org/apache/lucene/...). Qdrant's benchmarking code (https://github.com/qdrant/vector-db-benchmark/blob/9263ba/en...) does use the new filtered ann query with elastic 8.4, so it appears to be a fair benchmark. Why is lucene/elastic so much slower? Is it a rust vs. java thing? Or some memory management issues?


What effect will this have for connecting concepts between books? Either through summarization or topic mapping?


Does anybody have any good resources on the nuts and bolts of embeddings (particularly multi-modal embeddings)? I've got some experiments I'd like to run.


For multi-modal check out OpenAI CLIP: https://www.pinecone.io/learn/clip/


Exactly what I'm looking for, thank you.

EDIT: I am immediately struck that I need vastly more data and compute than I can afford for my experiment.


Also check out the embedding model from https://text-generator.io

It supports some things that openAI can't do, it retrieves any linked images of web pages, analyses the images or images with text inside to help the embedding model


Note that this model is not able to write code.

Prompt: "Write a C program to RSA encrypt a string."

text-embedding-ada-002:

%-C C C C C C C C

C

C

C

C

C C

C

C

C

C

C

C

C

C

C

C

C

C C C

C

C

C

C

C

C

C

C

C

C

C C

C

C

C

C C

C

C

C

C

C C C

C

C

C

C

C

C

C

C

C C C C C C C

C C

C

C

C C C C C C C

C

C C C C C C C

C

C C C C C C

C C C aC C C

C

m m m m m m m m m m m m m m m m m m mm m C mC

(Note: a relatively short max length)

Compare text-davinci-003:

#include <stdio.h> #include<math.h>

// Utility function to find GCD of 'a' // and 'b' int gcd(int a, int b) { if (a == 0) return b; return gcd(b % a, a); }

// Function to find Modular Multiplicative // Inverse of 'a' under modulo 'm' int modInverse(int a, int m) { a = a % m; for (int x = 1; x < m; x++) if ((a * x) % m == 1) return x; }

// Function to compute a/b under modulo m int modDivide(int a, int b, int m) { a = a % m; int inv = modInverse(b, m); return (inv * a) % m; }

// Function to calculate (a^b)%c long long int power(int a, int b, int c) { int res = 1; a = a % c; while (b > 0) { if (b & 1) res = modDivide(res * a, 1, c); b = b >> 1; a = modDivide(a * a, 1, c); } return res; }

// C function for extended Euclidean Algorithm int gcdExtended(int a, int b, int x, int y) { if (a == 0) { x = 0; y = 1; return b; } int x1, y1; int gcd = gcdExtended(b%a, a, &x1, &y1); x = y1 - (b/a) x1; y = x1;

    return gcd; 
}

// Function to generate public and private // keys void generateKeys(int p, int q, int e, int d, int n) { // n = pq n = p * q;

    // Computing phi(n) = (p-1)*(q-1) 
    int phi = (p - 1) * (q - 1); 
  
    // Choose an integer e such that 
    // e and phi(n) are co-prime 
    *e = 2; 
    while (*e < phi) 
    { 
        if (gcd(*e, phi)==1) 
            break; 
        else
            ++*e; 
    } 
  
    // Compute d such that it satisfies 
    // d*e = 1 + k * phi(n) 
    int k = 2;  // Fixed value 
    *d = (1 + (k * phi))/ *e; 
}

// Encryption Function long int RSA_Encrypt(int msg, int e, int n) { // Cipher Text c = (msg ^ e) % n long int c = power(msg, e, n);

    return c; 
}

// Decryption Function long int RSA_Decrypt(int c, int d, int n) { // Message msg = (c ^ d) % n long int msg = power(c, d, n);

    return msg; 
}

// Driver program int main() { // Two random prime numbers int p = 3; int q = 7;

    // Message to be encrypted 
    int msg = 15; 
  
    // Encryption key (e, n) 
    int e, d, n; 
  
    // Calculate n and phi 
    generateKeys(p, q, &e, &d, &n); 
  
    // Encryption 
    long int c = RSA_Encrypt(msg, e, n); 
    printf("Encrypted message is: %d\n", c); 
  
    // Decryption 
    long int m = RSA_Decrypt(c, d, n); 
    printf("Original Message is: %d\n", m); 
  
    return 0; 
}


It's an embedding model so it generates vector embeddings not text generations. That's to be expected.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: