LLM Tokenizer Compression

May 28

I woke up this morning and thought “when I send an API call to ChatGPT, is it generating text one token at a time? That can’t be the most efficient way to do it.”

Turns out I was right that that isn’t the most efficient way to do it. I probably wasn’t right in thinking that ChatGPT does it that way, but it’s hard to tell, because the tools I discovered for how to do it better are from Google, and Google is much better at doing energy-efficient compute than anyone else.

What I thought this morning

A naive implementation of a text prediction model will be predicting the next single token in sequence. This will require making one model prediction for every token in the response. However not all of these token predictions are equally difficult. If there are certain high probability sequences of tokens, we would prefer to only run the full, expensive LLM prediction once, and supplement this with a run of a much faster compression algorithm.

From a power consumption perspective, the advantages here are clear, as language is usually highly compressible and LLM models are vastly more expensive than compression/decompression.

That is, replacing the setup:

Text -> token sequence -> next token prediction -> output token

With the setup

Text -> token sequence -> compressed token sequence -> next compressed token prediction -> next compressed output token -> next token sequence

Changes the computational cost of a sequence of output tokens from:

(next token prediction cost) * (num output tokens)

(next compressed token prediction cost + compression/decompression cost) * (num compressed output tokens)

Where we would expect

compressed token prediction cost ~= token prediction cost >>>> compression/decompression cost

And

Num compressed output tokens =~ 1/10? Num output tokens

Depending on the perplexity of the compression model in practice.

Looking at these equations, I do wonder if increasing the complexity of the compression/decompression algorithm would actually be a practical helpful step; e.g. how well would a model function if it is made up of two equally sized language models, one handling compression and the other prediction? Can this scale to a larger number of sub-models? Is there a continuum here with a black-box language model on one side and something like a search engine on the other? This is getting way ahead of myself though.

So I did a little look around and found the paper Training LLMs over Neurally Compressed Text, published April 4, 2024, just under 8 weeks ago.

Questions I had while reading the (first relevant) paper (I found)

P4: “Training models to mimic other models can be difficult [41], …” is this not a perfect opportunity to use transfer learning? Since the compression model M1 is also trained “by us” and the model M2 has larger capacity, couldn’t one just stick the weights of M1 into a subset of M2? Perhaps with some additional scaffolding like a lone identity map through the rest of the tensor that is not randomly initialized?

Is there a term for this sort of process, which is not so much “transfer learning” as model augmentation after completion of a learning phase?

P5: “In this work, we use context-aware transformer-based language models [for M1]” why is this decision made/appropriate? Maybe using non-contextual compression does not result in perplexity gains over subword tokenization? This seems odd as BPE over tokens naively seems like a strong basis for comparison. Is use of transformer-based models optimal because of specialized compute?

P7: “The feed-forward layers use ReLU activations” Exclusively? Or is there input augmentation with some non-piecewise-linear functions like sine(a*x) at an early layer? Citation I’ve seen for this previously is cited as [67].

P10: In table 2, it looks like b is byte size (larger bytes means smaller proportions of the byte are wasted) and v is vocabulary size; but v=256 consistently and never v=65k+ which seemed better in table 1, since token compression was the larger goal than bit compression? Table 3 also uses only v=256.

P10: In table 3, it looks like everything except Bytes and SetencePiece (which are semantic encodings and not compression encodings) have very small differences between uniform and unigram, I think this is the intended takeaway?

P13: Figure 4: what motivates the shape of the pareto frontier on this graph? Is there going to be a graph in this paper that gives total compute cost for achieving given training performance and serving that model a given number of times? See also “This has the potential to reduce generation latency, at the cost of reduced compute efficiency.” What is compute efficiency measuring here? Relatedly, is v={2^12, 2^24} tried at any point? Again I want to know how these things scale with quantity of output data requested.

P14: “Short Windows are best…the shortest…windows perform the best.” What about even shorter windows? Do 8-bit windows lose too much in compression, e.g. are unable to encompass the token vocabulary? Is there a way to make 12-bit windows work, or is this fundamentally a hardware issue that really wants 8-bit bytes?

P15: “unlikely to deliver an ‘emergent’ ability…” seems less interesting whether an ability is “emergent” than whether it can significantly improve the pareto frontier for serving. Can we project whether a larger M2 model would be able to scale to performance that is compute-competitive with the SentencePiece encoding, while retaining lower latency? Can we project what size it needs to be to achieve this?

P17-19: EqualInfoAC loses semantic structure in tokenization; is it possible to recover some of this using semantic heuristics rather than purely informatic heuristics for ending a token? E.g. ending a compressed token at 14 bits if its last bit input was a space instead of trying to squeeze 1-2 more bits in. This seems similar to doing something like BPE. Is this just equivalent to SentencePiece? Is there a way to squeeze into the middle between the two? It seems like there is a lot of room to increase compression complexity. If the semantic structure of this is recovered, could larger bit sizes end up performing better?

P20: “...appears to be a challenge.” cop out!

Answers to my own questions now that I’ve read that paper

I jumped into the tokenization/compression space with basically zero background so a lot of my questions are answered elsewhere in the literature. I’m not even that familiar with the deep learning research space. But here’s my second pass through my own comments.

On page 4, I wondered about using some analogy from transfer learning to “embed” M1 into M2; I suspect this is a general question studied in the literature and [52] is probably the relevant citation which goes on my to-read list.

On page 7 I asked about exclusive use of ReLU activations. I am probably not adapted to the context that “Attention is All You Need” means that non-ReLU activations go in an input layer that gets referred to as an “attention” layer and isn’t being counted here; [67] also goes on my to-read list and I guess I get to swallow the fact that this seems like a bad vocabulary choice to me personally since it’s been the term of art for years.

In table 2 I was confused by only v=256 being included, but I was correct to interpret that they would use v=65k as the best tested model going forward.

In table 3, I was incorrect to think that SentencePiece is not a compression encoding. I may have been correct in guessing it was not purely a compression encoding, and I think not being aware of this much earlier (2018) find in the same space is probably a big gap in my background—that paper could easily have been a better place for me to start as a first pass at answering my initial questions! [38] also goes on my reading list.

In figure 4 on page 13, I’m still not clear on how the pareto-boundary was drawn or why it was drawn that way. I do still feel uncertain about what exactly is being measured at each instance, but this may be more of a function of my unfamiliarity with standard benchmarks of either LLMs or compression algorithms, or of the way these metrics are being combined in the paper, than it is an issue with the paper itself. I wonder if there is a standard reference or a good way for me to fill in some intuitions here.

On Page 14, I ask about More Dakka. If 16-bit windows are better than 32- and 64-bit windows, why not 12-bit windows? 8-bit windows? 1-bit windows? I do think the paper actually addresses the 1-bit window case. I can also imagine that there’s a problem when the window size is smaller than the vocabulary size (where v=”65k”=2^16). I can also see how, since most languages have about 100k words, increasing vocabulary size would have reduced impact. I’m still curious about what happens when you push the parameters a bit harder.

On Page 15, I am maybe just objecting to the inclusion of a section on “emergent abilities” at all, since it seems like the authors (justifiably) think this is a fake thing, and they answer my actual questions elsewhere.

On Pages 17-19, they discuss how EqualInfoAC tokens lose input mapping stability and semantic validity. This seems valid, but also seems like it’s solvable with a more complex algorithm; I think the authors agree here and see EIAC as a first pass or maybe a proof of concept that deeply semantic tokenization similar to SentencePiece can be surpassed.

Reading List

Having gone through all this, I really liked this paper. I followed it a lot better than most deep learning papers or math papers I’ve read.

But of course, nothing exists in a vacuum, and I’ve read very few deep learning papers relative to the questions I have. So here are the citation links I followed that I would absolutely read next if I was a grad student. Since I’m not, I’m likely to get distracted (as I got distracted with this from my previous Community Notes process), but we’ll see.

Adaptive Computation Time for Recurrent Neural Networks

MANTa: Efficient Gradient-based Tokenization for Robust End-to-End Language Modeling

Attention Is All You Need

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing