huggingface-tokenizers

HuggingFace Tokenizers

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "huggingface-tokenizers" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-huggingface-tokenizers

HuggingFace Tokenizers

Fast, production-ready tokenization - Rust-powered, Python API.

When to Use

  • High-performance tokenization (<20s per GB)

  • Train custom tokenizers from scratch

  • Track token-to-text alignment

  • Production NLP pipelines

  • Need BPE, WordPiece, or Unigram tokenization

Quick Start

from tokenizers import Tokenizer

Load pretrained

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Encode

output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]

Decode

text = tokenizer.decode(output.ids)

Train Custom Tokenizer

BPE (GPT-2 style)

from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel

Initialize

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel()

Configure trainer

trainer = BpeTrainer( vocab_size=50000, special_tokens=["<|endoftext|>", "<|pad|>"], min_frequency=2 )

Train

tokenizer.train(files=["data.txt"], trainer=trainer)

Save

tokenizer.save("my-tokenizer.json")

WordPiece (BERT style)

from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] )

tokenizer.train(files=["data.txt"], trainer=trainer)

Encoding Options

Single text

output = tokenizer.encode("Hello world")

Batch encoding

outputs = tokenizer.encode_batch(["Hello", "World"])

With padding

tokenizer.enable_padding(pad_id=0, pad_token="[PAD]") outputs = tokenizer.encode_batch(texts)

With truncation

tokenizer.enable_truncation(max_length=512) output = tokenizer.encode(long_text)

Access Encoding Data

output = tokenizer.encode("Hello world")

output.ids # Token IDs output.tokens # Token strings output.attention_mask # Attention mask output.offsets # Character offsets (alignment) output.word_ids # Word indices

Pre-tokenizers

from tokenizers.pre_tokenizers import ( Whitespace, # Split on whitespace ByteLevel, # Byte-level (GPT-2) BertPreTokenizer, # BERT style Punctuation, # Split on punctuation Sequence, # Chain multiple )

Chain pre-tokenizers

from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])

Post-processing

from tokenizers.processors import TemplateProcessing

BERT-style: [CLS] ... [SEP]

tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", tokenizer.token_to_id("[CLS]")), ("[SEP]", tokenizer.token_to_id("[SEP]")), ], )

Normalization

from tokenizers.normalizers import ( NFD, NFKC, Lowercase, StripAccents, Sequence )

BERT normalization

tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])

With Transformers

from transformers import PreTrainedTokenizerFast

Wrap for transformers compatibility

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

Now works with transformers

encoded = fast_tokenizer("Hello world", return_tensors="pt")

Save and Load

Save

tokenizer.save("tokenizer.json")

Load

tokenizer = Tokenizer.from_file("tokenizer.json")

From HuggingFace Hub

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Performance Tips

  • Use batch encoding for multiple texts

  • Enable padding/truncation once, not per-encode

  • Pre-tokenizer choice affects speed significantly

  • Train on representative data for better vocabulary

vs Alternatives

Tool Best For

tokenizers Speed, custom training, production

SentencePiece T5/ALBERT, language-independent

tiktoken OpenAI models (GPT)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review