Tokenizers in NLP: Word, Character, and Sub-Word Models
In the world of Natural Language Processing (NLP), the first step in almost every pipeline is tokenization — breaking raw text into smaller units, known as tokens. These tokens serve as the building blocks that machine learning models, especially Large Language Models (LLMs), use to understand and generate human language.
But not all tokenizers work the same way. Over the years, three primary types of tokenization strategies have emerged:
Word-Based Tokenizers
Character-Based Tokenizers
Sub-Word Tokenizers
Let’s explore each of them in detail.
1. Word-Based Tokenizers
How They Work
Word-based tokenizers split text into words using spaces and punctuation as boundaries.
For example:
Input: "Natural Language Processing is amazing."
Tokens: ["Natural", "Language", "Processing", "is", "amazing", "."]
Advantages
Intuitive and easy to understand.
Works well for languages with clear word boundaries (e.g., English, French).
Good for simple NLP tasks like bag-of-words models or keyword extraction.
Limitations
Out-of-Vocabulary (OOV) problem: If the tokenizer encounters a new word (e.g., “ChatGPTified”), it won’t know how to handle it.
Doesn’t generalize well to morphologically rich languages like Turkish, Finnish, or Hindi, where words can have many forms.
Large vocabulary size → inefficient for modern LLMs.
Example in Practice
- Early NLP models like Word2Vec or GloVe often relied on word-level tokenization.
2. Character-Based Tokenizers
How They Work
Character-based tokenizers break text down into individual characters, including punctuation and spaces.
Input: "NLP"
Tokens: ["N", "L", "P"]
Advantages
No OOV problem: Any word can be represented because it’s built from characters.
Useful for noisy text such as social media posts, where misspellings are common (e.g., "goooood" → g, o, o, o, o, d).
Works well for languages without clear word boundaries, such as Chinese or Japanese.
Limitations
Sequences become very long (e.g., "Artificial" = 10 tokens).
Harder for models to capture long-range dependencies, since they must learn word structure from scratch.
Less efficient for large-scale language modeling.
Example in Practice
Some early deep learning models in speech recognition and text classification explored character-level tokenization.
Still used in tasks where robustness to typos and non-standard input is essential.
3. Sub-Word Tokenizers
Sub-word tokenization strikes a balance between word-based and character-based approaches. Instead of splitting text strictly into words or characters, it breaks words into meaningful sub-units.
For example:
Input: "unbelievable"
Tokens: ["un", "believe", "able"]
If the tokenizer encounters a new word like "believability," it can still tokenize it effectively:
Tokens: ["believe", "ability"]
Advantages
Reduces OOV problems by breaking unknown words into known sub-units.
Keeps vocabulary size manageable.
Allows LLMs to handle rare and compound words effectively.
Common Algorithms
Byte Pair Encoding (BPE)
Starts with characters as the base vocabulary.
Iteratively merges the most frequent character pairs into sub-words.
Used in GPT-2, RoBERTa, and OpenAI’s CLIP.
WordPiece
Similar to BPE but optimizes likelihood using a probabilistic approach.
Used in BERT and its variants.
SentencePiece
Treats text as a raw stream of Unicode characters without assuming spaces.
Can produce BPE or unigram sub-word models.
Used in T5, XLNet, and ALBERT.
Limitations
Still requires careful preprocessing.
In some cases, sub-words may not align with human intuition (e.g., “playing” → "play", "##ing").