Skip to main content

Command Palette

Search for a command to run...

Tokenizers in NLP: Word, Character, and Sub-Word Models

Updated
3 min read

In the world of Natural Language Processing (NLP), the first step in almost every pipeline is tokenization — breaking raw text into smaller units, known as tokens. These tokens serve as the building blocks that machine learning models, especially Large Language Models (LLMs), use to understand and generate human language.

But not all tokenizers work the same way. Over the years, three primary types of tokenization strategies have emerged:

  1. Word-Based Tokenizers

  2. Character-Based Tokenizers

  3. Sub-Word Tokenizers

Let’s explore each of them in detail.


1. Word-Based Tokenizers

How They Work

Word-based tokenizers split text into words using spaces and punctuation as boundaries.
For example:

Input: "Natural Language Processing is amazing."
Tokens: ["Natural", "Language", "Processing", "is", "amazing", "."]

Advantages

  • Intuitive and easy to understand.

  • Works well for languages with clear word boundaries (e.g., English, French).

  • Good for simple NLP tasks like bag-of-words models or keyword extraction.

Limitations

  • Out-of-Vocabulary (OOV) problem: If the tokenizer encounters a new word (e.g., “ChatGPTified”), it won’t know how to handle it.

  • Doesn’t generalize well to morphologically rich languages like Turkish, Finnish, or Hindi, where words can have many forms.

  • Large vocabulary size → inefficient for modern LLMs.

Example in Practice

  • Early NLP models like Word2Vec or GloVe often relied on word-level tokenization.

2. Character-Based Tokenizers

How They Work

Character-based tokenizers break text down into individual characters, including punctuation and spaces.

Input: "NLP"
Tokens: ["N", "L", "P"]

Advantages

  • No OOV problem: Any word can be represented because it’s built from characters.

  • Useful for noisy text such as social media posts, where misspellings are common (e.g., "goooood" → g, o, o, o, o, d).

  • Works well for languages without clear word boundaries, such as Chinese or Japanese.

Limitations

  • Sequences become very long (e.g., "Artificial" = 10 tokens).

  • Harder for models to capture long-range dependencies, since they must learn word structure from scratch.

  • Less efficient for large-scale language modeling.

Example in Practice

  • Some early deep learning models in speech recognition and text classification explored character-level tokenization.

  • Still used in tasks where robustness to typos and non-standard input is essential.


3. Sub-Word Tokenizers

Sub-word tokenization strikes a balance between word-based and character-based approaches. Instead of splitting text strictly into words or characters, it breaks words into meaningful sub-units.

For example:

Input: "unbelievable"
Tokens: ["un", "believe", "able"]

If the tokenizer encounters a new word like "believability," it can still tokenize it effectively:

Tokens: ["believe", "ability"]

Advantages

  • Reduces OOV problems by breaking unknown words into known sub-units.

  • Keeps vocabulary size manageable.

  • Allows LLMs to handle rare and compound words effectively.

Common Algorithms

  1. Byte Pair Encoding (BPE)

    • Starts with characters as the base vocabulary.

    • Iteratively merges the most frequent character pairs into sub-words.

    • Used in GPT-2, RoBERTa, and OpenAI’s CLIP.

  2. WordPiece

    • Similar to BPE but optimizes likelihood using a probabilistic approach.

    • Used in BERT and its variants.

  3. SentencePiece

    • Treats text as a raw stream of Unicode characters without assuming spaces.

    • Can produce BPE or unigram sub-word models.

    • Used in T5, XLNet, and ALBERT.

Limitations

  • Still requires careful preprocessing.

  • In some cases, sub-words may not align with human intuition (e.g., “playing” → "play", "##ing").