What is Tokenization?

What is Tokenization?

Tokenization is the process of breaking down a sequence of text into smaller units, called "tokens". These tokens could be words, subwords, characters, or even phrases, depending on the specific tokenization strategy used. Tokenization is a fundamental step in natural language processing (NLP) tasks, as it converts raw text data into a format that can be processed by machine learning models.

Different types of tokenization:

  • Word Tokenization: This is the most common tokenization strategies, where the text is split into individual words based on whitespace or punctuation. For example, the sentence "The quick brown fox jumps" would be tokenized into ["The", "quick", "brown", "fox", "jumps"].

  • Subword Tokenization: In subword tokenization, words are broken down into smaller units, typically based on their frequency or morphology. This is particularly useful for handling out-of-vocabulary words and languages with complex morphology. Popular subword tokenization algorithms include Byte Pair Encoding (BPE) and SentencePiece.

    • Example of BPE:

      • Lets say we have the following corpus of words: ["low", "lower", "newest", "widest", "running", "lows"]

      • Initialization : Initially, each character in the vocabulary is treated as a subword

        Vocabulary: {'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'u', 'g'}

      • Merge Most Frequent Pair: Iterate through the corpus and merge the most frequent pair of subwords.

        • Iteration 1: Merge 'e' and 's' -> 'es' Vocabulary: {'l', 'o', 'w', 'es', 'r', 'n', 't', 'i', 'd', 'u', 'g'}

        • Iteration 2: Merge 'es' and 't' -> 'est' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g'}

        • Iteration 3: Merge 'e' and 's' -> 'es' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g', 'es'}

      • Repeat Merging: Repeat the merging process for a fixed number of iterations or until the vocabulary size reaches a predefined threshold.

        • Iteration 4: Merge 'es' and 't' -> 'est' Vocabulary: {'l', 'o', 'w', 'est', 'r', 'n', 'i', 'd', 'u', 'g'}

        • Iteration 5: Merge 'est' and 'l' -> 'estl' Vocabulary: {'o', 'w', 'estl', 'r', 'n', 'i', 'd', 'u', 'g'}

      • Tokenization: Tokenize new words using the learned subword vocabulary.

        • For example:

          • 'lower' would be tokenized as ['low', 'er'].

          • 'running' would be tokenized as ['r', 'un', 'n', 'ing'].

      • Here's a Python implementation of BPE using the tokenizers library:

          from tokenizers import ByteLevelBPETokenizer
          tokenizer = ByteLevelBPETokenizer()
          tokenizer.train(files=["corpus.txt"], vocab_size=1000, min_frequency=2)
          encoded = tokenizer.encode("lower")
          print(encoded.tokens)  # Output: ['low', 'er']
        
  • Character Tokenization: In character tokenization, each character in the text is treated as a separate token. This strategy is useful for tasks where character-level information is important, such as text generation or spelling correction. You will want to use it if you want to preserve the smallest units of text and don't need to consider word boundaries.

  • Phrasal Tokenization: Phrasal tokenization involves identifying and tokenizing multi-word phrases or expressions as single units. This can be useful for preserving the meaning of idiomatic expressions or named entities.

Tokenization Libraries:

  • NLTK (Natural Language Toolkit)

  • SpaCy

  • Hugging Face's Tokenizers library

  • Stanford CoreNLP

Challenges with Tokenization:

Tokenization can be challenging, especially for languages with complex morphology, ambiguous word boundaries, or noisy text data. It requires careful handling of punctuation, special characters, and language-specific rules.


Each token is then represented by a unique numerical identifier

The textual tokens (words/subwords/characters) into numerical values that can be processed by machine learning models.

In NLP tasks, textual data needs to be converted into numerical format because most machine learning algorithms operate on numerical data. Each token in the text is assigned a unique numerical identifier, typically an integer, which allows the model to understand and process the text.

Using the example: "The quick brown fox jumps." After tokenization, the tokens might be represented as follows:

  • "The" → 1

  • "quick" → 2

  • "brown" → 3

  • "fox" → 4

  • "jumps" → 5

Each token in the sentence is mapped to a unique numerical identifier. During the encoding process, the model replaces each token with its corresponding numerical value, creating a sequence of numerical tokens that can be fed into the machine learning model for further processing.

This numerical representation allows the model to learn patterns and relationships within the text data and make predictions or perform tasks such as classification, generation, or translation.