Very Interesting... I have similar idea to train LLM in Serbian, create even new...

xodn348 · 2026-03-14T22:43:03 1773528183

Really interesting approach — attacking token efficiency at the encoding level is more fundamental than what I did.

Even without retraining BPE from scratch, starting with YUTF-8 and measuring how existing tokenizers handle it would already be a worthwhile experiment.

Hope you find the time to build it, good luck!

oddmiral · 2026-03-15T06:57:43 1773557863

You can look at Ukrainian LLM Lapa for inspiration:

https://huggingface.co/spaces/lapa-llm/lapa

Best tokenizer for the Ukrainian language

Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results.