Tucano: a series of decoder-transformers natively pre-trained in Portuguese

Patterns
arXiv HF Link License: Apache 2.0 DOI

To stimulate the future of open development of neural text generation in Portuguese, we present both GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens, and Tucano, a series of decoder-transformers natively pre-trained in Portuguese. All byproducts of our study, including the source code used for training and evaluation, are openly released on GitHub and Hugging Face.

News 🚀

📣 Tucanos are fun, but we also want to help build tools for other languages! New releases of the Tucano project, as well as new resources for other low-resource languages, will soon be available in our new organization: Polyglot! Polyglot is a research project from the University of Bonn, where we seek to aid in the development of foundation models for low-resource languages. So, if you like Tucanos, go follow Polyglot to stay updated with our new releases.

Community Contributions 🤝

  • Demo on how to run inference on ViTucano 👉 Open In Colab

  • Demo on how to run inference on Tucano 👉 Open In Colab

  • Demo on how to create a simple Chat UI for Tucano using Gradio 🚀 Open In Colab

  • Tucano OpenVINO is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.