Tokemon.
The morphological tokenizer for Hebrew and Aramaic.
Tokemon reads Semitic text the way the language is actually built, around the root. It tokenizes near 1.3 tokens per word, keeps 98% of roots intact, and round trips with zero loss. Built for the AI pipelines that work in Hebrew and Aramaic.
▸ input → בראשית ברא אלהים את השמים
▸ fertility → 1.4 tok/word · ~1.6× tighter than chatgpt's tokenizer
▸ roots → 98.03% preserved · root stays intact
▸ round trip → lossless · final letter forms preserved
Lossless · root aware · ~16K tokens vocabulary
The numbers, on real text, with nothing lost.
Fertility on Tanakh
On the same Torah text: today’s ChatGPT tokenizer (o200k) needs about 2.3 tokens per word, the older cl100k over 5, a Hebrew-trained BPE 1.45, a SentencePiece unigram 1.49. Tokemon needs 1.43, the tightest of them.
Roots kept intact
Measured on Tanakh against gold morphological lemmas. The root survives tokenization.
Nothing lost
Every sequence decodes back to the exact original Hebrew, final letter forms and all.
Dense and efficient
More than 11,000 of the most frequent whole words sit in the vocabulary directly, so common words cost a single token.
Fertility is measured on the full Tanakh, 306,785 words, with a fully lossless round trip, and holds across registers: 1.52 on Mishneh Torah, 1.63 on the Mishnah. Held out Gemara stays tight too, 1.76 tokens per word on a Talmud Bavli tractate the tokenizer never saw. Root preservation is scored against gold morphological lemmas.
BPE shatters a language built on roots.
Byte pair tokenizers were tuned for English. They chop Hebrew and Aramaic into arbitrary fragments that ignore how the language actually works: a three letter root carried through a pattern, wrapped in prefixes and suffixes. The structure that makes the language learnable is exactly the structure BPE destroys.
One root, recognized across its forms
A byte tokenizer never learns these share a root. Tokemon treats the root as one unit, so the relationship is there from the first token.
Root aware. Lossless. Compact.
It sees the root, not the bytes
Hebrew and Aramaic build meaning from a root threaded through a pattern. Tokemon recognizes that shared root across every inflected form, so a model sees one stable unit of meaning where a byte tokenizer sees noise.
Perfect reconstruction, every time
Tokenize and detokenize normal Hebrew with zero drift. The exact text comes back, final letter forms included, where a byte tokenizer can leave a medial letter where a final should stand. No silent corruption sneaks into your pipeline.
Fewer tokens, longer context
A vocabulary of around 16K tokens, with more than 11,000 of the most frequent whole words packed in directly, so common words cost one token. At around 1.3 tokens per word, Tokemon spends far less of your context window on the same text. Faster inference, lower cost, no loss of fidelity.
Built for Semitic languages, starting where it matters most.
Hebrew
Biblical, Mishnaic and Rabbinic, and modern Hebrew too. Normal Hebrew round trips exactly, final letter forms included.
Aramaic
Talmudic and Targumic, the same tokenizer, with 99.85% root preservation by token mass on Talmudic Aramaic.
Arabic
The same root and pattern morphology, next on the Semitic line.
Judeo-Arabic
Hebrew script over an Arabic substrate, a natural extension.
Greek
A separate lexeme and stem aware tokenizer, already in development.
Loanwords
First class handling of foreign and technical words written in Hebrew letters.
Fast to ingest, lossless at scale, reproducible by default.
Tokemon is built to ingest a whole tradition, not just score on samples. Its hot path keeps pace with SentencePiece, the parallel emitter eats entire corpora in minutes, and every token is written two bytes wide and memory mappable for training.
Warm throughput
On par with SentencePiece on the hot path.
Full Talmud Bavli
The entire Bavli tokenized with eight workers.
Whole Sefaria corpus
286M words, 515.8M tokens, 6,205 texts, end to end.
Lossless on 286M words
Round trip verified across all 6,205 texts, not a sample.
Per token (uint16)
Compact on disk and memory mappable for training.
Context free
Same surface form, same analysis, every run, bit for bit.
One call to tokenize, one to bring it back.
Tokemon exposes a clean tokenizer interface: text in, token ids out, and a lossless path back. It slots in wherever you would reach for a BPE or SentencePiece tokenizer today, with no change to the rest of your stack.
POST /v1/tokenize
{ "text": "בְּרֵאשִׁית בָּרָא אֱלֹהִים" }
→ { "tokens": [ ... ], "count": 5, "fertility": 1.30 }from tokemon import Tokenizer
tok = Tokenizer.load("hebrew-aramaic-v1")
ids = tok.encode("בְּרֵאשִׁית בָּרָא אֱלֹהִים")
text = tok.decode(ids) # exact round tripWhat teams ask before integrating.
Tokemon is a tokenizer for Hebrew and Aramaic that is aware of how those languages are built. Where a byte tokenizer chops text into arbitrary fragments, Tokemon recognizes the root carried through a word, keeps it intact, and tokenizes and detokenizes with a fully lossless round trip.
Tokenize Semitic text the way it was written.
Read the technical report for the full benchmarks, or bring Tokemon into your pipeline today.