LilTii: A 0.6B Bengali Language Model that Outperforms Qwen

Published in Hugging Face Blog, 2026

Abstract

Big multilingual foundation models pretty much run the show in NLP right now — but that dominance has also made language inequality worse. Low-resource languages often get the short end of the stick. In this work, we introduce LilTii, a 0.6B-parameter Bengali language model trained completely from scratch to help close that gap. Unlike earlier Bengali models that simply continue training from large, opaque multilingual models, LilTii is built through a fully transparent, reproducible pipeline. It’s specifically designed to work well even in limited-compute environments. To make this happen, we compiled a high-quality Bengali corpus using both heuristic and learned filtering (LLM-as-a-judge). We supplemented it with carefully curated English data for bilingual augmentation. Using this dataset, we experiment with different training recipes for small-scale Bengali models. Across a wide range of Bengali benchmarks, LilTii consistently outperforms similarly sized multilingual models such as Qwen2.5-0.5B and Qwen3-0.6B. The takeaway? Maybe there is still room for pre-training in the small-scale/low-resource language scene.

BibTeX

@misc{fatimah2026liltii,
  title=,
  author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a},
  year={2026},
  howpublished={\url{https://hf.co/blog/Polygl0t/liltii}}
}