HuggingFace Tokenizers
Fast, production-ready tokenization - Rust-powered, Python API.
When to Use
-
High-performance tokenization (<20s per GB)
-
Train custom tokenizers from scratch
-
Track token-to-text alignment
-
Production NLP pipelines
-
Need BPE, WordPiece, or Unigram tokenization
Quick Start
from tokenizers import Tokenizer
Load pretrained
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Encode
output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
Decode
text = tokenizer.decode(output.ids)
Train Custom Tokenizer
BPE (GPT-2 style)
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel
Initialize
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel()
Configure trainer
trainer = BpeTrainer( vocab_size=50000, special_tokens=["<|endoftext|>", "<|pad|>"], min_frequency=2 )
Train
tokenizer.train(files=["data.txt"], trainer=trainer)
Save
tokenizer.save("my-tokenizer.json")
WordPiece (BERT style)
from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] )
tokenizer.train(files=["data.txt"], trainer=trainer)
Encoding Options
Single text
output = tokenizer.encode("Hello world")
Batch encoding
outputs = tokenizer.encode_batch(["Hello", "World"])
With padding
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]") outputs = tokenizer.encode_batch(texts)
With truncation
tokenizer.enable_truncation(max_length=512) output = tokenizer.encode(long_text)
Access Encoding Data
output = tokenizer.encode("Hello world")
output.ids # Token IDs output.tokens # Token strings output.attention_mask # Attention mask output.offsets # Character offsets (alignment) output.word_ids # Word indices
Pre-tokenizers
from tokenizers.pre_tokenizers import ( Whitespace, # Split on whitespace ByteLevel, # Byte-level (GPT-2) BertPreTokenizer, # BERT style Punctuation, # Split on punctuation Sequence, # Chain multiple )
Chain pre-tokenizers
from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])
Post-processing
from tokenizers.processors import TemplateProcessing
BERT-style: [CLS] ... [SEP]
tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", tokenizer.token_to_id("[CLS]")), ("[SEP]", tokenizer.token_to_id("[SEP]")), ], )
Normalization
from tokenizers.normalizers import ( NFD, NFKC, Lowercase, StripAccents, Sequence )
BERT normalization
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
With Transformers
from transformers import PreTrainedTokenizerFast
Wrap for transformers compatibility
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
Now works with transformers
encoded = fast_tokenizer("Hello world", return_tensors="pt")
Save and Load
Save
tokenizer.save("tokenizer.json")
Load
tokenizer = Tokenizer.from_file("tokenizer.json")
From HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Performance Tips
-
Use batch encoding for multiple texts
-
Enable padding/truncation once, not per-encode
-
Pre-tokenizer choice affects speed significantly
-
Train on representative data for better vocabulary
vs Alternatives
Tool Best For
tokenizers Speed, custom training, production
SentencePiece T5/ALBERT, language-independent
tiktoken OpenAI models (GPT)
Resources