Question 1

What is Tokemon?

Accepted Answer

Tokemon is a tokenizer for Hebrew and Aramaic that is aware of how those languages are built. Where a byte tokenizer chops text into arbitrary fragments, Tokemon recognizes the root carried through a word, keeps it intact, and tokenizes and detokenizes with a fully lossless round trip.

Question 2

How does it compare to tiktoken, SentencePiece and BPE?

Accepted Answer

On the same normal Hebrew text, the full Torah, same word basis: Tokemon spends 1.43 tokens per word. The tokenizers behind ChatGPT (both byte level BPE) are far behind, 2.32 for today’s o200k and 5.04 for the older cl100k. A BPE trained on Hebrew lands at 1.45 and a SentencePiece unigram at 1.49, so Tokemon edges even the dedicated Hebrew tokenizers here, while being the only one that keeps the root in the stream and round trips losslessly. The caveat: on more diverse later texts a Hebrew BPE compresses 13 to 35 percent tighter than Tokemon, the price Tokemon pays for keeping the morphology and the lossless guarantee.

Question 3

Is the round trip really lossless?

Accepted Answer

On normal Hebrew, the way the language is written, yes: every token sequence decodes back to the exact original text, final letter forms included, where naive tokenizers often leave a medial letter where a final should stand. Nothing is approximated and nothing is silently dropped.

Question 4

Which languages does it support today?

Accepted Answer

Hebrew and Aramaic are live now, across biblical, mishnaic, rabbinic, talmudic and targumic registers. Arabic, Judeo-Arabic and a separate Greek tokenizer are on the roadmap.

Question 5

Does it handle modern Hebrew?

Accepted Answer

It does, and losslessly. The build labels which lexicon entries are genuine Hebrew and which are loanwords, foreign words carried into Hebrew letters. Measured over the full lexicon of 494,208 forms, all round tripping without loss, genuine Hebrew words tokenize at 3.95 tokens per word and loanwords at 6.40. These are isolated dictionary forms, so both sit above running text, but the gap shows the higher modern Hebrew average comes from borrowed vocabulary, not native words. First class handling of loanwords is on the roadmap.

Question 6

How fast is it, and is it safe to run at scale?

Accepted Answer

The hot path sustains 2.9 MB/s, on par with SentencePiece. With eight workers the parallel emitter tokenizes the entire Talmud Bavli in 6.4 seconds and the full 286 million word Sefaria corpus in about 17 minutes. That whole corpus, 6,205 texts, round trips with zero failures, so there is no silent corruption to reconcile. Tokenization is deterministic and context free, so runs are reproducible bit for bit.

Question 7

How do I integrate it into my pipeline?

Accepted Answer

Tokemon exposes a clean tokenizer interface, text in and token ids out, with a lossless path back. It slots in wherever you would use a BPE or SentencePiece tokenizer today, with no change to the rest of your stack.

Question 8

Why does fewer tokens per word matter?

Accepted Answer

At around 1.3 tokens per word instead of 4.6, the same text uses far less of your context window. That means longer effective context, faster inference and lower cost, with no loss of fidelity since the round trip stays lossless.

Tokemon.

The numbers, on real text, with nothing lost.

Fertility on Tanakh

Roots kept intact

Nothing lost

Dense and efficient

BPE shatters a language built on roots.

Root aware. Lossless. Compact.

It sees the root, not the bytes

Perfect reconstruction, every time

Fewer tokens, longer context

Built for Semitic languages, starting where it matters most.

Hebrew

Aramaic

Arabic

Judeo-Arabic

Greek

Loanwords

Fast to ingest, lossless at scale, reproducible by default.

Warm throughput

Full Talmud Bavli

Whole Sefaria corpus

Lossless on 286M words

Per token (uint16)

Context free

One call to tokenize, one to bring it back.

What teams ask before integrating.

Tokenize Semitic text the way it was written.