La Boétie
Semitic Tokenizer · V1 · Hebrew and Aramaic, live

Tokemon.

The morphological tokenizer for Hebrew and Aramaic.

Tokemon reads Semitic text the way the language is actually built, around the root. It tokenizes near 1.3 tokens per word, keeps 98% of roots intact, and round trips with zero loss. Built for the AI pipelines that work in Hebrew and Aramaic.

tokemon@laboetie ~ tokenize

input → בראשית ברא אלהים את השמים

fertility → 1.4 tok/word · ~1.6× tighter than chatgpt's tokenizer

roots → 98.03% preserved · root stays intact

round trip → lossless · final letter forms preserved

Lossless · root aware · ~16K tokens vocabulary

Measured, not claimed

The numbers, on real text, with nothing lost.

1.30tokens / word

Fertility on Tanakh

On the same Torah text: today’s ChatGPT tokenizer (o200k) needs about 2.3 tokens per word, the older cl100k over 5, a Hebrew-trained BPE 1.45, a SentencePiece unigram 1.49. Tokemon needs 1.43, the tightest of them.

98.03%root preservation

Roots kept intact

Measured on Tanakh against gold morphological lemmas. The root survives tokenization.

100%lossless round trip

Nothing lost

Every sequence decodes back to the exact original Hebrew, final letter forms and all.

11,000+whole words in ~16K vocab

Dense and efficient

More than 11,000 of the most frequent whole words sit in the vocabulary directly, so common words cost a single token.

Fertility is measured on the full Tanakh, 306,785 words, with a fully lossless round trip, and holds across registers: 1.52 on Mishneh Torah, 1.63 on the Mishnah. Held out Gemara stays tight too, 1.76 tokens per word on a Talmud Bavli tractate the tokenizer never saw. Root preservation is scored against gold morphological lemmas.

Why Semitic text breaks ordinary tokenizers

BPE shatters a language built on roots.

Byte pair tokenizers were tuned for English. They chop Hebrew and Aramaic into arbitrary fragments that ignore how the language actually works: a three letter root carried through a pattern, wrapped in prefixes and suffixes. The structure that makes the language learnable is exactly the structure BPE destroys.

One root, recognized across its forms

כ · ת · בktb · to write
כָּתַבkatav · he wrote
כְּתִיבָהketivah · writing
הִכְתִּיבhikhtiv · he dictated
מִכְתָּבmikhtav · a letter

A byte tokenizer never learns these share a root. Tokemon treats the root as one unit, so the relationship is there from the first token.

Three things it gets right

Root aware. Lossless. Compact.

Morphology aware

It sees the root, not the bytes

Hebrew and Aramaic build meaning from a root threaded through a pattern. Tokemon recognizes that shared root across every inflected form, so a model sees one stable unit of meaning where a byte tokenizer sees noise.

Lossless

Perfect reconstruction, every time

Tokenize and detokenize normal Hebrew with zero drift. The exact text comes back, final letter forms included, where a byte tokenizer can leave a medial letter where a final should stand. No silent corruption sneaks into your pipeline.

Compact

Fewer tokens, longer context

A vocabulary of around 16K tokens, with more than 11,000 of the most frequent whole words packed in directly, so common words cost one token. At around 1.3 tokens per word, Tokemon spends far less of your context window on the same text. Faster inference, lower cost, no loss of fidelity.

One family, by design

Built for Semitic languages, starting where it matters most.

Live now

Hebrew

Biblical, Mishnaic and Rabbinic, and modern Hebrew too. Normal Hebrew round trips exactly, final letter forms included.

Aramaic

Talmudic and Targumic, the same tokenizer, with 99.85% root preservation by token mass on Talmudic Aramaic.

On the roadmap

Arabic

The same root and pattern morphology, next on the Semitic line.

Judeo-Arabic

Hebrew script over an Arabic substrate, a natural extension.

Greek

A separate lexeme and stem aware tokenizer, already in development.

Loanwords

First class handling of foreign and technical words written in Hebrew letters.

Built to run at corpus scale

Fast to ingest, lossless at scale, reproducible by default.

Tokemon is built to ingest a whole tradition, not just score on samples. Its hot path keeps pace with SentencePiece, the parallel emitter eats entire corpora in minutes, and every token is written two bytes wide and memory mappable for training.

2.9 MB/s

Warm throughput

On par with SentencePiece on the hot path.

6.4 s

Full Talmud Bavli

The entire Bavli tokenized with eight workers.

~17 min

Whole Sefaria corpus

286M words, 515.8M tokens, 6,205 texts, end to end.

0 failures

Lossless on 286M words

Round trip verified across all 6,205 texts, not a sample.

2 bytes

Per token (uint16)

Compact on disk and memory mappable for training.

Deterministic

Context free

Same surface form, same analysis, every run, bit for bit.

Drop it into your pipeline

One call to tokenize, one to bring it back.

Tokemon exposes a clean tokenizer interface: text in, token ids out, and a lossless path back. It slots in wherever you would reach for a BPE or SentencePiece tokenizer today, with no change to the rest of your stack.

REST
POST /v1/tokenize
{ "text": "בְּרֵאשִׁית בָּרָא אֱלֹהִים" }

→ { "tokens": [ ... ], "count": 5, "fertility": 1.30 }
Python
from tokemon import Tokenizer

tok = Tokenizer.load("hebrew-aramaic-v1")
ids = tok.encode("בְּרֵאשִׁית בָּרָא אֱלֹהִים")
text = tok.decode(ids)   # exact round trip
FAQ

What teams ask before integrating.

Tokemon is a tokenizer for Hebrew and Aramaic that is aware of how those languages are built. Where a byte tokenizer chops text into arbitrary fragments, Tokemon recognizes the root carried through a word, keeps it intact, and tokenizes and detokenizes with a fully lossless round trip.

Tokenize Semitic text the way it was written.

Read the technical report for the full benchmarks, or bring Tokemon into your pipeline today.