Generative AI models don’t process text the same way humans do. Understanding their internal, token-based environment can help explain some of their strange behaviors and stubborn limitations.
Most models, from small in-device models like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as a transformer. Because of the way transformers evoke associations between text and other types of data, they can’t receive or generate raw text, at least not without a massive amount of computation.
So, for both pragmatic and technical reasons, current transformer models work with text broken down into smaller pieces, called tokens, a process known as tokenization.
Tokens can be words, like “fantastic.” Or syllables, like “fan,” “tas,” and “tic.” Depending on the tokenizer (the model that performs the tokenization), they can even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).
With this method, transformers can assimilate more information (in the semantic sense) before reaching an upper limit called the context window. But tokenization can also introduce bias.
Some tokens have strange spacing, which can throw a transformer off track. A tokenizer might encode “once upon a time” as “once”, “upon”, “a”, “time”, for example, while encoding “once upon a” (which has a trailing space) as “once”, “upon”, “a”, “.” Depending on how a pattern is prompted (with “once upon a” or “once upon a ,”), the results can be completely different, because the pattern doesn’t understand (as a person would) that the meaning is the same.
Tokenizers also treat case differently. “Hello” is not necessarily the same as “HELLO” for a pattern; “hello” is usually one token (depending on the tokenizer), while “HELLO” can contain up to three tokens (“HE”, “El”, and “O”). This is why many transformers fail the uppercase test.
“It’s pretty hard to get around the question of exactly what a ‘word’ should be for a language model, and even if we could get human experts to agree on a perfect token vocabulary, the models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying the interpretability of large language models at Northeastern University, told TechCrunch. “I think there’s no such thing as a perfect tokenizer because of this kind of fuzziness.”
This “fuzziness” creates even more problems in languages other than English.
Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t, and neither do Korean, Thai, or Khmer.
A 2023 study from Oxford found that because of differences in how non-English languages are tokenized, a transformer can take twice as long to complete a task formulated in a non-English language as the same task formulated in English. The same study — and another — found that users of less “token-efficient” languages are likely to see their models degrade, while paying more for their use, given that many AI vendors charge per token.
Tokenizers often treat each character in logographic writing systems (systems in which printed symbols represent words unrelated to pronunciation, such as Chinese) as a separate token, leading to a high number of tokens. Similarly, tokenizers that treat agglutinative languages (languages in which words are made up of small, meaningful word elements called morphemes, such as Turkish) tend to turn each morpheme into a token, thereby increasing the total number of tokens. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)
In 2023, Yennie Jun, an AI researcher at Google DeepMind, conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed 10 times more tokens to capture the same meaning in English.
Beyond linguistic inequalities, tokenization could explain why current models are bad at mathematics.
It’s rare for numbers to be tokenized consistently. Because they don’t really know what numbers are, tokenizers may treat “380” as a single token, but represent “381” as a pair (“38” and “1”), destroying the relationships between numbers and outcomes in equations and formulas. The result is transformer confusion; a recent paper showed that models have trouble understanding repeating numerical patterns and context, especially temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926.)
This is also why models are not very effective at solving anagram problems or reversing words.
So tokenization clearly presents challenges for generative AI. Can they be solved?
Maybe.
Feucht refers to “byte-level” state-space models like MambaByte, which can ingest much more data than transformers without performance loss by removing tokenization altogether. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language parsing tasks while better handling “noise” like words with swapped characters, spacing, and capitalization.
Models like MambaByte, however, are in the early stages of research.
“It’s probably better to let the models look at the characters directly without imposing tokenization, but right now that’s just computationally impossible for transformers,” Feucht said. “For transformer models in particular, the computation scales quadratically with the length of the sequence, so we really want to use short textual representations.”
Barring a breakthrough in tokenization, it appears that new model architectures will be key.