Probably pay attention to tokenizers

kelseyfrog · 2024-10-23T16:45:56 1729701956

Tokenizers aren't considered the "sexy" part of LLMs, but where others see boring, I see opportunity. Papers like xVal[1], point toward specialization strategies in tokenization. Spelling and letter tasks are another problem that could benefit from innovation on the tokenization.

LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission. GPT-4o, for example, writes a small python program and executes it in order to count letter instances. We all know that tokenization effectively erases knowledge about letters in prompts and directly negatively impacts performance at these tasks, yet we haven't found a way to solve it.

1. https://ar5iv.labs.arxiv.org/html/2310.02989

bunderbunder · 2024-10-23T20:05:42 1729713942

This was ages ago, in the pre-transformer era, and I can't find the link anymore. But once upon a time I read a great paper that demonstrated that most of the performance differences being reported among popular embedding models of the time were better explained by text cleaning and tokenization than they were by the embedding model itself.

In other words, if you train a model using word2vec's preprocessing and GloVe's algorithm, the result looks more like a "standard-issue" word2vec model than a "standard-issue" GloVe model.

authorfly · 2024-10-24T10:04:24 1729764264

Yes those models were sensitive to the preprocessing, far more than transformers.

However Word2Vec and GloVe were fundamentally different, when used as designed GloVe worked better pretty uniformly.

screye · 2024-10-23T18:13:10 1729707190

Tokenizers face an odd compute issue.

Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

Separately,

As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

[1] https://arxiv.org/abs/1705.03633

IncreasePosts · 2024-10-23T16:50:55 1729702255

What's the issue with character-level tokenization(I assume this would be much better at count-the-letter tasks)? The article mentions it as an option but doesn't talk about why subword tokenization is preferred by most of the big LLMs out there.

stephantul · 2024-10-23T17:09:05 1729703345

Using subwords makes your sequences shorter, which makes them cost less.

Besides that, for alphabetic languages, there exists almost no relation between form and meaning. I.e.: “ring” and “wing” differ by one letter but have no real common meaning. By picking the character or byte as your choice of representation, the model basically has to learn to distinguish ring and wing in context. This is a lot of work!

So, while working on the character or byte level saves you some embeddings and thus makes your model smaller, it puts all of the work of distinguishing similar sequences with divergent meanings on the model itself, which means you need a larger model.

By having subwords, a part of this distinguishing work already has been done by the vocabulary itself. As the article points out, this sometimes fails.

sundarurfriend · 2024-10-23T21:43:31 1729719811

> Besides that, for alphabetic languages, there exists almost no relation between form and meaning.

Also true for Abugida-based languages, for eg. சரம் (saram = string) vs மரம் (maram = tree), and many more. I think your intention with specifying "alphabetic languages" was to say "non-logographic languages", right?

bunderbunder · 2024-10-23T22:08:05 1729721285

I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

And even in Chinese it's a fairly weak relationship. A large portion of the meanings of individual characters come from sound loan. For example the 英 in 英雄 means "hero", in 英语 means "England", an in 精英 means "flower". The relationship there is simple homophony.

On the other hand, one thing you do get with written Chinese is that "1 character = 1 morpheme" very nearly works. So mechanistically breaking a text into a sequence of morphemes can be done pretty reliably without the aid of a semantic model or exhaustive hard-coded mapping. I think that for many other languages you can't even get close using only syntactic analysis.

thaumasiotes · 2024-10-24T00:55:02 1729731302

> I'll do you one more and say "non-Chinese languages". Written Japanese - including the kanji portion of the script - has the same characteristic.

Written Japanese is much more ideographic than written Chinese. Japanese spelling is determined, such as it is, by semantics. Chinese spelling is determined by sound. Thus, 女的, 娘们, and 妮子, all meaning 'girl' or 'woman', have no spelling in common because they are different words, while Japanese uses 女 for "jo" and "onna" despite a total lack of any relationship between those words.

stephantul · 2024-10-24T04:05:06 1729742706

I was trying to say “at least for alphabetic languages”. I don’t like to say things about languages I can’t speak or write. So, no, it wasn’t my intention to say “non-logographic languages”

bunderbunder · 2024-10-23T20:08:04 1729714084

I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.

thaumasiotes · 2024-10-24T00:59:21 1729731561

What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

In either case, isn't this something we already do well?

bunderbunder · 2024-10-24T14:37:30 1729780650

Punctuation, for example.

And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.

For example, the GPT-4 tokenizer breaks the first sentence of your post like so:

  What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?

Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.

Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".

p1esk · 2024-10-23T21:17:52 1729718272

Has anyone tried to combine a token embedding with some representation of the characters in the (sub)word? For example, use a 512 long vector to represent a token, and reserve the last 12 values to spell out the word.

mattnewton · 2024-10-23T21:45:03 1729719903

I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?

p1esk · 2024-10-23T23:19:53 1729725593

mattnewton · 2024-10-24T18:22:42 1729794162

Actually adding the numerical value I think is not the right way to do it because of what happens when you matmul those values - usually the right way to do it would be to have low dimensional character embeddings that are learnable in addition to the token embeddings.

The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).

My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.

But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.

kadushka · 2024-10-24T18:32:35 1729794755

Which tasks would you test the expected improvements on from this addition?

mattnewton · 2024-10-24T18:44:10 1729795450

maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L

And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.

kadushka · 2024-10-24T18:52:49 1729795969

Which model would you recommend to try it on? Would you train it from scratch, or finetune an existing one?

mattnewton · 2024-10-24T19:38:14 1729798694

You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.

stephantul · 2024-10-24T04:09:59 1729742999

Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.

p1esk · 2024-10-24T05:12:39 1729746759

The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).

RicoElectrico · 2024-10-23T21:25:08 1729718708

Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.

p1esk · 2024-10-24T04:07:40 1729742860

IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

stephantul · 2024-10-24T04:12:31 1729743151

It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

p1esk · 2024-10-24T05:25:25 1729747525

Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.

SEGyges · 2024-10-23T16:55:23 1729702523

tokens are on average four characters and the number of residual streams (and therefore RAM) the LLM allocates to a given sequence is proportionate to the number of units of input. the flops is proportionate to their square in the attention calculation.

you can hypothetically try to ameliorate this by other means, but if you just naively drop from tokenization to character or byte level models this is what goes wrong

p1esk · 2024-10-24T05:31:47 1729747907

4x seq length expansion doesn’t sound that bad.

lechatonnoir · 2024-10-24T09:03:41 1729760621

I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.

SEGyges · 2024-10-25T20:56:34 1729889794

it is not necessarily 16x if you, e.g., decrease model width by a factor of 4 or so also, but yeah naively the RAM and FLOPs scale up by n^2

Centigonal · 2024-10-23T16:56:47 1729702607

I think it has to do with both performance (smaller tokens means more tokens per sentence read and more runs per sentence generated) and with how embeddings work. You need a token for "dog" and a token for "puppy" to represent the relationship between the two as a dimension in latent space.

cma · 2024-10-23T19:20:34 1729711234

Context length performance and memory scales N^2. Smaller tokens mean worse scaling, up to a point.

Der_Einzige · 2024-10-23T19:56:55 1729713415

I wrote a whole paper about this exact topic! (Syntactic, phonetic, and related constraints)

https://aclanthology.org/2022.cai-1.2/

kaycebasques · 2024-10-23T18:10:24 1729707024

> but where others see boring, I see opportunity

I feel this way about embeddings

This line of thought seems related to the old wisdom of finding innovative solutions by mucking around in the layer below whatever the "tools of the trade" are for your domain

doctorpangloss · 2024-10-23T19:04:53 1729710293

> LLMs are notoriously bad at counting letters in words or performing simply oulipos of letter omission.

If it were so simple, why hasn’t this already been dealt with?

Multimodal VQA models also have had a hard time generalizing counting. Counting is not as simple as changing the tokenizer.

smougel · 2024-10-24T13:39:15 1729777155

There is a paper about : https://arxiv.org/pdf/2405.18719

kelseyfrog · 2024-10-23T19:14:24 1729710864

I'm saying the oulipo rule is simple, not the task given current tokenization methods

danielmarkbruce · 2024-10-23T23:56:16 1729727776

Should the number 23 be tokenized as one token or two tokens?

doctorpangloss · 2024-10-24T03:37:24 1729741044

It doesn’t matter. The challenge with counting doesn’t have to do with tokenization. Why this got into the zeitgeist, I don’t know.

imtringued · 2024-10-24T07:25:04 1729754704

No LLM struggles with two digit arithmetic. 100 digit addition is possible with the use of state of the art position encodings. Counting is not bottlenecked by arithmetic at all.

When you ask an LLM to count the number of "r" in the word Strawberry, the LLM will output a random number. If you ask it to separate the letters into S t r a w b e r r y, then each letter is tokenized independently and the attention mechanism is capable of performing the task.

What you are doing is essentially denying that the problem exists.

ithkuil · 2024-10-24T12:42:37 1729773757

Gpt4-o correctly answered

"How many letters "r" are in the word Frurirpoprar"

And it didn't use a program execution (at least it didn't show the icon and the answer was very fast so it's unlikely it generated an executed a program to count)

danielmarkbruce · 2024-10-24T16:43:52 1729788232

I wouldn't consider that a thing that's going to work generally. That word may tokenize to one per char and have seen relevant data, or it may be relatively close to some other word and it's seen data which gives the answer.

Der_Einzige · 2024-10-26T08:23:33 1729931013

Why would you confidently say such a lie like this? It's exactly the opposite. It's mostly due to toeknization. Show NeurIPS papers which give evidence of the opposite because I can square up with NeurIPS papers to substantiate that it is tokenization that causes these issues.

danielmarkbruce · 2024-10-24T13:29:58 1729776598

Tokenization absolutely screws with math and counting.

thaumasiotes · 2024-10-24T01:02:24 1729731744

Two. That's the reality.

You interpret the token sequence by constructing a parse tree, but that doesn't require you to forget that the tokens exist.

danielmarkbruce · 2024-10-24T02:00:21 1729735221

If you use standard BPE, you likely won't tokenize every number by it's digits, depending on the data set used to create the tokenizer.

The point is, you have a choice. You can do the tokenization however you like. The reason 23 is interesting is that there is a case to be made that a model will more likely understand 23 is related to Jordan if it's one token, and if it's two tokens it's more difficult. The opposite is true for math problems.

The reality is whatever we want to make it. It's likely that current schemes are... sub optimal. In practice it would be great if every token was geometrically well spaced after embedding, and preserve semantic information, among other things. The "other things" have taken precedent thus far.

tomrod · 2024-10-24T00:06:33 1729728393

We already solved that with binary representation ;-)

danielmarkbruce · 2024-10-23T23:24:45 1729725885

And decoders.

Suppafly · 2024-10-24T16:13:00 1729786380

>oulipos

?

Joker_vD · 2024-10-23T17:34:38 1729704878

> You need to understand [the input data] before you can do anything meaningful with it.

IMHO that's the main reason people turn to any sort of automated data-processing tools in the first place: they don't want to look at the input data. They'd rather have "the computer" look at it and maybe query them back with some additional info gathering requests. But thinking on their own? Ugh.

So I boldly propose the new definition of AGI: it's the data-processing entity that will (at last!) reliably liberate you from having to look at your data before you start shoving this data into that processing entity.

bunderbunder · 2024-10-23T19:51:33 1729713093

Over the past year I've encountered so many situations where a person's opinion of how well an LLM accomplishes a task actually says more about that person's reading comprehension skills than it does the LLM's performance. This applies to both positive and negative opinions.

Spivak · 2024-10-23T16:45:41 1729701941

I think I take something different away from the article, yes tokenizers are important but they're a means to get at something much much bigger which is how to clean up and normalize unstructured data. It's a current endeavor of mine at $dayjob for how to do this in a way that can work reasonably well even for badly mangled documents. I don't have any silver bullets, at least nothing worthy of a blog-post yet, but since this is needed when dealing with OCR documents so "post-ocr correction" turns up quite a few different approaches.

And this is an aside, but I see folks using LLMs to do this correction in the first place. I don't think using LLMs to do correction in a multi-pass system is inherently bad but I haven't been able to get good results out of "call/response" (i.e. a prompt to clean up this text). The best results are when you're running an LLM locally and cleaning incrementally by using token probabilities to help guide you. You get some candidate words from your wordlist based on the fuzzy match of the text you do have, and candidate words predicted from the previous text and when both align -- ding! It's (obviously) not the fastest method however.

SEGyges · 2024-10-23T17:00:24 1729702824

you might have better luck giving the LM the original document and having it generate its own OCR independently, then asking the llm to tiebreak between its own generation and the OCR output while the image is still in the context window until it is satisfied that it got things correct

7thpower · 2024-10-23T18:23:28 1729707808

This is interesting. What types of content are you using this approach on and how does it handle semi structured data? For instance, embedded tables.

HanClinto · 2024-10-23T20:46:44 1729716404

I really appreciated this blog post, and in particular I appreciated the segment talking about typos.

We were discussing this earlier this week -- I'm helping with a RAG-like application for a project right now, and we're concerned with how much small typos or formatting differences in users' queries can throw off our embedding distances.

One thought was: Should we be augmenting our training data (or at the very least, our pretraining data) with intentional typos / substitutions / capitalizations, just to help it learn that "wrk" and "work" are probably synonyms? I looked briefly around for typo augmentation for (pre)training, and didn't see anything at first blush, so I'm guessing that if this is a common practice, that it's called something else.

tmikaeld · 2024-10-23T20:53:05 1729716785

I work with full text search where this is common. Here is some points.

Stemming: Reducing words to their base or root form (e.g., “working,” “worked” becoming “work”).

Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

Token normalization: Standardizing tokens, such as converting “wrk” to “work” through predefined rules (case folding, character replacement).

Fuzzy matching: Allowing approximate matches based on edit distance (e.g., “wrk” matches “work” due to minimal character difference).

Phonetic matching: Matching words that sound similar, sometimes used to match abbreviations or common misspellings.

Thesaurus-based search: Using a predefined list of synonyms or alternative spellings to expand search queries.

Most of these are open and free lists you can use, check the sources on manticore search for example.

thaumasiotes · 2024-10-24T01:04:02 1729731842

> Lemmatization: Similar to stemming, but more sophisticated, accounting for context (e.g., “better” lemmatizes to “good”).

I don't understand. How is that different from stemming? What's the base form of "better" if not "good"? The nature of the relationship between "better" and "good" is no different from that between "work" and "worked".

authorfly · 2024-10-24T10:08:31 1729764511

Stemming is basically rules based on the characters. It came first.

This is because most words in most languages follow patterns of affixes/prefixes (e.g. worse/worst, harder/hardest), but not always (good/better/best)

The problem was that word/term frequency based modelling would inappropriately not linked terms that actually had the same route (stam or stem).

Stemming removed those affixes so it turned "worse and worst" into "wor and wor" and "harder/hardest" into "hard", etc.

However it failed for cases like good/better.

Lemmatizing was a larger context and built up databases of word senses linking such cases to more reliably process words. So lemmatizing is rules based, plus more.

thaumasiotes · 2024-10-24T10:36:03 1729766163

> So lemmatizing is rules based, plus more.

Fundamentally, the rule of lemmatizing is that you encounter a word, you look it up in a table, and your output is whatever the table says. There are no other rules. Thus, the lemma of seraphim is seraph and the lemma of interim is interim. (I'm also puzzled by your invocation of "context", since this is an entirely context-free process.)

There has never been any period in linguistic analysis or its ancestor, philology, in which this wasn't done. The only reason to do it on a computer is that you don't have a digital representation of the mapping from token to lemma. But it's not an approach to language processing, it's an approach to lack of resources.

authorfly · 2024-10-25T10:26:56 1729852016

We don't disagree. A look up table with exact rules is a rules system to me from an NLP/GOFAI perspective. I was aware of how the libraries tend to work because I had often used things like looking up lemmas/word sense/pos in NLTK and Spacy in the past, and I know the libraries code fairly well.

Context today may mean more (e.g. the whole sentence, or string, or the prompt context for an LLM), and obviously context has a meaning in computational linguistics (e.g. "context free grammar"), but the point here is stemmers arbitrary follow the same process without a second stage. If a stemmer encounters "best" and "good" it by definition does not have a stage to use the same lemma for them. Context is just one of those overloaded terms unfortunately.

Lemmatizing, in terms of how it works on simple scenarios (lets imagine reviews) helps to lump those words together and correctly identify the proportion of term frequencies for words we might be interested in more consistently than stemming can. It's still limited by using word breaks like spaces or punctuation ofcourse.

mannykannot · 2024-10-24T14:08:17 1729778897

I see your point about context-free table lookup, but it looks to me as though authorfly's distinctions would apply to how the tables get written in the first place.

soared · 2024-10-23T22:33:32 1729722812

Porter stemming is currently widely used in adtech for keywords.

bongodongobob · 2024-10-23T22:05:13 1729721113

I'm glad this is mentioned. I've suspected that using correct grammar, punctuation and spelling greatly impacts response quality. It's hard to objectify so I've just decided to write my prompts in perfect English just to be sure. I have a friend who prompts like he texts and I've always felt he was getting lower quality responses. Not unusable, just a little worse, and he needs to correct it more.

authorfly · 2024-10-24T10:19:40 1729765180

Check out the training data. Sentence transformer models training data includes lots of typos and this is desirable. There was debate around training/inferencing with stemmed/postprocessing words for a long time.

Typos should minimally impact your RAG.

alexhutcheson · 2024-10-24T13:44:00 1729777440

It depends if they are using a “vanilla” instruction-tuned model or are applying additional task-specific fine-tuning. Fine-tuning with data that doesn’t have misspellings can make the model “forget” how to handle them.

In general, fine-tuned models often fail to generalize well on inputs that aren’t very close to examples in the fine-tuning data set.

authorfly · 2024-10-25T10:20:44 1729851644

Yes, but you can control that.

You can use set fit, less examples, or SVM or etc depending on how much separation, recall and other aspects matter to you for the task at hand.

Sensitivity level to biasing to the dataset is a choice of training method, not an attribute.

It's just not really a major issue unless you finetune with an entirely new or unseen language in the present day.

HanClinto · 2024-10-24T18:44:27 1729795467

This is very helpful, thank you!

We are doing a fair bit of task-specific fine-tuning for an asymmetric embeddings model (connecting user-entered descriptions of symptoms with the service solutions that resolved their issues).

I would like to run more experiments with this and see if introducing typos into the user-entered descriptions will help it not forget as much.

Thank you again!

andix · 2024-10-23T21:39:23 1729719563

For queries there is an easy solution: give the question/search term to a LLM and let it rephrase it. A lot of basic RAG examples do that.

This might also work for indexing your data, but has the potential to get really expensive quickly.

yoelhacks · 2024-10-23T18:21:57 1729707717

I used to work on an app that very heavily leaned on Elasticsearch to do advanced text querying for similarities between a 1-2 sentence input and a corpus of paragraph+ length documents.

It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.

Different domain, but this doesn't shock me at all.

carom · 2024-10-23T20:12:10 1729714330

The trained embedding vectors for the token equivalents of W4 and W-4 would be mapped to a similar space due to their appearance in the same contexts.

dangerlibrary · 2024-10-23T22:49:34 1729723774

The point of the GP post is that the "w-4" token had very different results from ["w", "-4"] or similar algorithms where the "w" and "4" wound up in separate tokens.

AStrangeMorrow · 2024-10-24T00:06:27 1729728387

Yes, used to work on a system that has elasticsearch and also some custom Word2Vec models. What had the most impact on the quality of the search is ES and on the quality of our W2V model were tokenization and a custom ngrams system.

cranium · 2024-10-24T05:56:51 1729749411

I finally understood the weirdness of tokenizers after watching the video Andrej Karpathy made: "Let's build the GPT Tokenizer" (https://www.youtube.com/watch?v=zduSFxRajkE).

He goes through why we need them instead of raw byte sequences (too expensive) and how the Byte Pair Encoding algorithm works. Worth spending 2h for the deeper understanding if you deal with LLMs.

Xenoamorphous · 2024-10-23T18:29:21 1729708161

> One of the things I noticed over the past year is how a lot of developers who are used to developing in the traditional (deterministic) space fail to change the way they should think about problems in the statistical space which is ultimately what LLM apps are.

I’m a developer and don’t struggle with this, where I really struggle is trying to explain this to users.

bcherry · 2024-10-23T18:14:34 1729707274

It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

2. compute embedding for each chunk

3. generate search query/queries

4. compute embedding for each query

5. rank corpus chunks by distance to query (vector search)

6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.

ratedgene · 2024-10-23T17:57:38 1729706258

Can't someone expand on this

> Chunking is more or less a fixable problem with some clever techniques: these are pretty well documented around the internet;

Curious about what chunking solutions are out there for different sets of data/problems

hansvm · 2024-10-23T19:43:35 1729712615

It's only "solved" if you're okay with a 50-90% retrieval rate or have particularly nice data. There's a lot of stuff like "referencing the techniques from Chapter 2 we do <blah>" in the wild, and any chunking solution is unlikely to correctly answer queries involving both Chapter 2 and <blah>, at least not without significant false positive rates.

That said, the chunking people are doing is worse than the SOTA. The core thing you want to do is understand your data well enough to ensure that any question, as best as possible, has relevant data within a single chunk. Details vary (maybe the details are what you're asking for?).

pphysch · 2024-10-23T19:11:02 1729710662

Most data has semantic boundaries: whether tokens, words, lines, paragraphs, blocks, sections, articles, chapters, versions, etc. and ideally the chunking algorithm will align with those boundaries in the actual data. But there is a lot of variety.

haolez · 2024-10-24T13:56:56 1729778216

I had some success with simple aliasing at the beginning and end of chunks. In my next project, I'll try an idea that I saw somewhere:

1. do naive chunking like before 2. calculate the embeddings of each chunk 3. clusterize the chunks by their embeddings to see which chunks actually bring new information to the corpus 4. summarize similar chunks into smaller chunks

Sounds like a smart way of using embeddings to reduce the amount of context misses. I'm not sure it works well, though :)

r_hanz · 2024-10-24T01:35:07 1729733707

Very nicely written article. Personally, I find RAG (and more abstractly, vector search) the only mildly interesting development in the latest LLM fad, and have always felt that LLMs sit way too far down the diminishing returns curve to be interesting. However, I can’t believe tokenization and embeddings in general, are not broadly considered the absolutely most paramount aspect of all deep learning. The latent space your model captures is the most important aspect of the whole pipeline, or else what is any deep learning model even doing?

halyax7 · 2024-10-23T18:54:39 1729709679

an issue I've seen in several RAG implementations is assuming that the target documents, however cleverly they're chunked, will be good search keys for incoming queries. Unless your incoming search text looks semantically like the documents you're searching over (not the case in general), you'll get bad hits. On a recent project, we saw a big improvement in retrieval relevance when we separated the search keys from the returned values (chunked documents), and we used an LM to generate appropriate keys which were then embedded. Appropriate in this case means "sentences like what the user might input if theyre expecting this chunk back"

marlott · 2024-10-23T20:46:58 1729716418

Interesting! So you basically got a LM to rephrase the search phrase/keys into the style of the target documents, then used that in the RAG pipeline? Did you do an initial search first to limit the documents?

NitpickLawyer · 2024-10-23T22:26:46 1729722406

IIUC they're doing some sort of "q/a" for each chunk from documents, where they ask an LLM to "play the user role and ask a question that would be answered by this chunk". They then embed those questions, and match live user queries with those questions first, then maybe re-rank on the document chunks retrieved.

andix · 2024-10-23T21:36:24 1729719384

This is an awesome article, but I’m missing the part where solutions for each of the problems were discussed.

Run a spell check before tokenizing? Maybe even tokenize the misspelled word and the potential corrected word next to each other like „misspld (misspelled)“?

For the issue with the brand names the tokenizer doesn’t know, I have no idea how to handle it. This problem is probably even worse in less common languages, or in languages which use a lot of compound words.

quirkot · 2024-10-23T20:15:49 1729714549

Is this true?

>> Do not panic! A lot of the large LLM vocabularies are pretty huge (30k-300k tokens large)

Seems small by an order of magnitude (at least). English alone is 1+ millions words

macleginn · 2024-10-23T22:04:50 1729721090

Most of these 1+ million words are almost never used, so 200k is plenty for English. Optimistically, we hope that rarer words would be longer and to some degree compositional (optim-ism, optim-istic, etc.), but unfortunately this is not what tokenisers arrive at (and you are more likely to get "opt-i-mis-m" or something like that). People have tried to optimise tokenisation and the main part of LLM training jointly, which leads to more sensible results, but this is unworkable for larger models, so we are stuck with inflated basic vocabularies.

It is also probably possible now to go even for larger vocabularies, in the 1-2 million range (by factorising the embedding matrix, for example), but this does not lead to noticeable improvements in performance, AFAIK.

Der_Einzige · 2024-10-23T23:07:46 1729724866

Performance would be massively improved on constrained text tasks. That alone makes it worth it to expand the vocabulary size.

mmoskal · 2024-10-23T20:24:59 1729715099

Tokens are often sub-word, all the way down to bytes (which are implicitly understood as UTF8 but models will sometimes generate invalid UTF8...).

spott · 2024-10-24T03:16:13 1729739773

BPE is complete. Every valid Unicode string can be encoded with any BPE tokenizer.

BPE basically starts with a token for every valid value for a Unicode byte and then creates new tokens by looking at common pairs of bytes (‘t’ followed by ‘h’ becomes a new token ’th’)

maytc · 2024-10-24T07:03:16 1729753396

The difference in the dates example seems right to me 20 October 2024 and 2024-20-10 are not the same.

Months in different locales can be written as yyyy-MM-dd. It can also be a catalog/reference number. So, it seems right that their embedding similarity is not perfectly aligned.

So, it's not a tokenizer problem. The text meant different things according to the LLM.

woolr · 2024-10-23T20:50:06 1729716606

Can't repro some of the numbers in this blog post, for example:

  from sentence_transformers import SentenceTransformer
  from sentence_transformers import util

  model = SentenceTransformer('all-MiniLM-L6-v2')

  data_to_check = [
    "I have recieved wrong package",
    "I hve recieved wrong package"
  ]
  embeddings = model.encode(data_to_check)
  util.cos_sim(embeddings, embeddings)

Outputs:

  tensor([[1.0000, 0.9749],
        [0.9749, 1.0000]])

1986 · 2024-10-23T21:04:26 1729717466

Your data differs from theirs - they have "I have received wrong package" vs "I hve received wrong pckage", you misspelled "received" in both and didn't omit an "a" from "package" in the "bad" data

gavin_gee · 2024-10-24T18:38:30 1729795110

do pictograms represent a way to reduce tokens?