token-approx helps you train small, calibrated models that locally approximate Claude token counts using basic text features (bytes, runes, words, lines).
Note
This repo contains the code, data and data-prep tooling behind the blog post “Counting Claude Tokens Without a Tokenizer”. You can use it to either:
- reproduce the experiments from the post, or
- plug in your own dataset and calibrate a local token approximation model for your own text
When you’re building agents or tools on top of Claude, token counts are your core budget and constraint – they determine:
- how much context you can include
- how often you can call the model
- how much each run costs
For Claude 3+ there’s no official local tokenizer, and the usual workarounds aren’t great:
- provider heuristics (e.g. "3.5 characters ≈ 1 token") can be wildly off on real text and across languages
- alternate-provider tokenizers like
tiktokensystematically under/over-estimate - calling a remote Token Count API adds network latency, rate limits, extra failure modes, and one more external dependency – just to count tokens.
token-approx exists to give you a small, evidence-based, local approximation instead:
- you calibrate simple linear models on your own text distribution
- you get fast, pre-send token estimates that are competitive with, or better than, popular off-the-shelf approximations
- you keep control: everything runs locally, and the whole pipeline (data prep → labeling → modeling) is transparent and reproducible.
This matters most when you care about tight control over context windows and cost – e.g. long-running agents, tools that stream lots of user or document text through Claude, or systems where a few percent error in token estimates can mean blowing your budget or silently dropping useful context.
Use token-approx if you:
- Need local, pre-send token estimates to manage Claude context windows and budgets
- Use LLMs from a provider (like Anthropic) that doesn’t ship a modern local tokenizer but does expose token usage counts (via its messages API or a token-count endpoint)
- Want a small, reproducible pipeline to:
- build a labeled dataset for your own text distribution
- fit simple linear models that map local features → token counts
- easily export them for use in your own application
cmd/token-approx/– Go CLI with subcommandsget-data,clean,split,measure.data/– working directories created and used by the CLI:data/raw/,data/interim/,data/processed/samples/,data/processed/datasets/.
notebooks/– top-level notebooks + helpers for fitting and comparing models ondata/processed/datasets/dataset.jsonl.experiments/– pre-prepared datasets and notebooks for the three Gutenberg books used in the blog post (fully wired, no data prep or API calls needed).scripts/– helpers to generate predictions from existing off-the-shelf methods (Anthropic's suggested heuristic,tiktoken, Anthropic legacy tokenizer).
You have three main options, depending on how much setup you want to do:
-
Fastest path – inspect pre-prepared experiments
- Explore already-run experiments for the example datasets (no additional data prep or API calls required) → see Experiment & notebooks
-
From-scratch pipeline – reproduce Oliver Twist experiment end-to-end Run the full pipeline on the example Gutenberg dataset:
- Prepare data via
token-approxCLI:get-data→clean→split→measureto builddataset.jsonl; then use the scripts inscripts/to generate baseline predictions for the existing off-the-shelf token approximation methods - Run the experiment using the template notebooks in
notebooks/→ see From-scratch pipeline
- Prepare data via
-
Run on your own data Use the same pipeline structure on your own text:
- Prepare data via similar CLI pipeline on your own
.txtsamples:split→measureto getdataset.jsonl; scripts inscripts/to generate off-the-shelf method predictions - Run the experiment using the template notebooks in
notebooks/→ see Use your own data
- Prepare data via similar CLI pipeline on your own
git clone https://github.com/petasbytes/token-approx.git
cd token-approxImportant
- From this point onwards run all commands from repo root to ensure correct data I/O file paths
-
Go 1.25+ (required for running the
token-approxCLI tool) -
Python 3.11+ (required for running notebooks)
-
Notebook dependencies and virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Optional:
ANTHROPIC_API_KEYexported in your environment (for ground-truth labeling usingtoken-approx measure)
export ANTHROPIC_API_KEY=...-
Optional: Node.js 20+ & Anthropic legacy tokenizer (for generating token count predictions using the legacy tokenizer):
If you plan to run the template notebooks in
notebooks/, you’ll need Node +@anthropic-ai/tokenizersoscripts/preds_anth-ts.pycan run:
# 1) Install Node, e.g. using Homebrew (macOS)
brew install node # or download from https://nodejs.org
# 2) Install Anthropic legacy tokenizer
npm install @anthropic-ai/tokenizergo build -o token-approx ./cmd/token-approx
./token-approx --help
# If you see module errors: go mod tidyRun the full end-to-end pipeline on a Project Gutenberg text to produce a ready-to-use dataset for the notebooks.
Before proceeding, complete Setup (incl. Node + @anthropic-ai/tokenizer).
Important
Run all commands from repo root
# 1) Download raw text
./token-approx get-data
# 2) Clean: strip Gutenberg boilerplate, normalize whitespace
./token-approx clean
# 3) Split into roughly equal-sized samples
./token-approx split
# 4) Label samples with ground-truth token counts
export ANTHROPIC_API_KEY=...
./token-approx measure # calls Anthropic Token Count API (free, rate-limited)Outputs:
- raw:
data/raw/oliver-twist_gberg_raw.txt - cleaned:
data/interim/oliver-twist_gberg_clean.txt - samples:
data/processed/samples/oliver-twist_gberg_sample-XXX.txt - labeled dataset:
data/processed/datasets/dataset.jsonl
# 1) Generate token predictions using the legacy tokenizer
python scripts/preds_anth-ts.py
# 2) Generate token predictions using 3.5-char-per-token heuristic
python scripts/preds_heuristic.py
# 3) Generate token predictions using tiktoken
python scripts/preds_tiktoken.pyOutput: three preds_*.jsonl files in data/processed/datasets/
source .venv/bin/activate
jupyter notebook notebooks/01_eda_and_baselines.ipynbThis main notebook will:
- load
data/processed/datasets/dataset.jsonland thepreds_*.jsonlfiles - compute existing method baselines (heuristic, legacy tokenizer,
tiktoken, etc.) - fit and compare single- and multi-feature linear models
- export coefficients to
models/model_coefs.json
You can also use token-approx as a small dataset builder + token-labeler for your own text.
Before proceeding, ensure you have completed the required setup (incl. installation of Node & the legacy Anthropic tokenizer package).
Tip
If you're starting with a single large text file (i.e. you don't yet have individual text samples) then start at Step 1.
If you already have per-sample .txt files then skip to step 2.
Place your cleaned text file at data/interim/<basename>_clean.txt, then split it into samples:
./token-approx splitOutput: multiple data/processed/samples/<basename>_sample-XXX.txt
- Compute features and label with token counts:
export ANTHROPIC_API_KEY=...
./token-approx measureOutput: data/processed/datasets/dataset.jsonl
- Generate baseline token-count predictions for the existing off-the-shelf token approximation methods:
python scripts/preds_anth-ts.py
python scripts/preds_heuristic.py
python scripts/preds_tiktoken.pyOutput: three preds_*.jsonl files in data/processed/datasets/
source .venv/bin/activate
jupyter notebook notebooks/01_eda_and_baselines.ipynbTip
Idempotence for token-approx measure
- Re-running
measurewill skip already-measured files identified bysource_pathindataset.jsonl - To re-measure/re-label a single sample, delete its line from the JSONL file and re-run
token-approx measure
For more details, see CLI commands.
Note
The notebooks in experiments/ can be read/re-run without any additional data prep (as all relevant data has been pre-computed).
→ This means you do not need an Anthropic API key, Node, the legacy tokenizer, or to run any CLI commands or prediction scripts.
For each narrative text mentioned in the blog post, you’ll find a directory under experiments/:
experiments/oliver-twist_gberg/experiments/war-n-peace_gberg/experiments/les-trois-mousq_gberg/
Each has ready-to-go:
- Data in
experiments/<book-id>/data/processed/datasets/dataset.jsonlpreds_anth-ts.jsonl,preds_heuristic.jsonl,preds_tiktoken.jsonl
- Notebooks (with results):
experiments/<book-id>/notebooks/01_eda_and_baselines_<book-id>.ipynb02_appendix_diagnostics_<book-id>.ipynb
To explore a specific book’s experiment:
1. Setup notebook dependencies and virtual environment
For example:
source .venv/bin/activate
jupyter notebook experiments/oliver-twist_gberg/notebooks/01_eda_and_baselines_oliver-twist.ipynbAll token-approx CLI commands operate relative to the repo root and the ./data directory.
- Downloads Oliver Twist from Project Gutenberg
- Writes:
data/raw/oliver-twist_gberg_raw.txt
- Expects exactly one
data/raw/*_raw.txt - Strips Gutenberg boilerplate, normalizes newlines, trims whitespace
- Writes:
data/interim/<book-id>_clean.txt(e.g.data/interim/oliver-twist_gberg_clean.txt)
- Expects exactly one
data/interim/*_clean.txt - Splits the text into roughly equal-sized (by character/rune) samples
- Writes:
data/processed/samples/<basename>_sample-XXX.txt
- Requires
ANTHROPIC_API_KEYin env - Processes all regular, non-hidden
*.txtfiles indata/processed/samples/(no recursion) - For each sample:
- computes local features:
bytes,runes,words,lines - gets ground-truth
input_tokensvia Anthropic’s Token Count API (free, rate-limited)
- computes local features:
- Appends one JSONL object per sample to
data/processed/datasets/dataset.jsonlwith fields:id,model,input_tokens,features.{bytes,runes,words,lines},source_path
- Missing API key
measureexits early withmissing ANTHROPIC_API_KEY→ set it and re-run
- No samples found
- Ensure
.txtfiles exist indata/processed/samples/(flat directory, no hidden files, not directories)
- Ensure
- Multiple inputs for
clean/split- Keep exactly one matching file in
data/raw/*_raw.txt(forclean) ordata/interim/*_clean.txt(forsplit)
- Keep exactly one matching file in
- Token Count API rate limits
- If you hit rate limits, wait and re-run
measure(see Anthropic docs for current limits)
- If you hit rate limits, wait and re-run
- Blog post
- Counting Claude Tokens Without a Tokenizer – background, experimental design, results tables, and discussion
- Related project
go-agent– LLM-backed agent in Go that originally motivated this token-approximation project