Tucano: a series of decoder-transformers natively pre-trained in Portuguese
To stimulate the future of open development of neural text generation in Portuguese, we present both GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens, and Tucano, a series of decoder-transformers natively pre-trained in Portuguese. All byproducts of our study, including the source code used for training and evaluation, are openly released on GitHub and Hugging Face.
News 🚀
📣 Tucanos are fun, but we also want to help build tools for other languages! New releases of the Tucano project, as well as new resources for other low-resource languages, will soon be available in our new organization: Polyglot! Polyglot is a research project from the University of Bonn, where we seek to aid in the development of foundation models for low-resource languages. So, if you like Tucanos, go follow Polyglot to stay updated with our new releases.
- [24/07/2025] Peer-reviewed article “Tucano: Advancing Neural Text Generation for Portuguese” is published in Patterns, with all models and datasets released on Hugging Face.
- [13/01/2025] We release ViTucano, a pair of vision assistants natively pretrained in Portuguese (ViTucano-1b5-v1, ViTucano-2b8-v1).
- [13/01/2025] We release the datasets used to pretrain and fine-tune the ViTucano models: ViTucano-Pretrain and ViTucano-SFT.
- [29/11/2024] Tucano is mentioned on Deutsche Welle: “Cientistas criam maior banco de dados em português para IA”.
- [27/11/2024] Tucano video presentation at the C4AI (USP) [available on YouTube].
- [12/11/2024] “Tucano: Advancing Neural Text Generation for Portuguese” is published as a preprint on ArXiv, with all models and datasets released on Hugging Face.
Community Contributions 🤝
Demo on how to create a simple Chat UI for Tucano using Gradio 🚀
Tucano OpenVINO is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.
