Tucano

Abstract

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face.

Our study brings the following advancements to the Portuguese NLP community:

The concatenation of a larger and more high-quality dataset for Portuguese language modeling (GigaVerbo).
The development of learned filters and datasets to improve text pre-processing for Portuguese.
Pushing self-supervised pretraining beyond the 500 billion tokens mark for Portuguese monolingual models.
The development of new, low-resource, efficient, and effective open-source language models for Portuguese (Tucano).
A critical assessment and comparison of currently available benchmarks for Portuguese language models.

An Anthology of Portuguese LLM Development

Our study provides a historical overview of Portuguese LLM development from 2020 to October 2024, allowing readers to better contextualize our and past works.

This image illustrates several Portuguese language model releases from 2020 to June 2024.

GigaVerbo

GigaVerbo contains over 145 million documents, amounting to 780 GB of text. Much like other text datasets, Gigaverbo was formed by concatenating several portions of openly available datasets for Portuguese and deduplicating their summation.

Our dataset is composed of the following subsets:

Subset	Nº of Samples	%	Description
monoHPLT-PT	58,244,012	40.09%	The clean and deduplicated Portuguese portion of the High-Performance Language Technologies resources dataset.
CrawlPT	43,846,974	30.17%	A deduplicated Portuguese corpus extracted from various web pages, concatenated from CC-100, Oscar, and BrWaC.
Multilingual-C4	16,092,571	11.07%	The Brazilian Portuguese cleaned portion of the m-C4 dataset.
Common Crawl	12,470,998	8.58%	A clean and deduplicated snapshot of the Common Crawl dataset (CC-MAIN-2023-23).
BlogSet-BR	4,321,181	2.97%	A collection of blog posts written in Brazilian Portuguese.
Instruct-PTBR	2,962,856	2.04%	A mix of multiple instruction datasets for various tasks, machine-translated from English to Brazilian Portuguese.
Corpus Carolina	2,075,395	1.43%	An open corpus with varied typology in contemporary Brazilian Portuguese.
UltrachatBR	1,255,091	0.86%	A Portuguese version (machine-translated) of the Ultrachat dataset.
Wikipedia	1,101,475	0.76%	Cleaned Portuguese articles built from the Wikipedia dumps.
CulturaX	999,994	0.69%	The Portuguese portion of CulturaX, a multilingual dataset with 167 languages.
LegalPT	925,522	0.64%	A concatenation of publicly available legal data in Portuguese, including legislation, jurisprudence, and legal articles.
Gpt4All	808,803	0.56%	A Portuguese (machine-translated) version of the Gpt4All dataset.
Bactrian-X	66,994	< 0.1%	The Portuguese portion of Bactrian-X, a collection of instruction-response pairs in 52 languages.
XL-Sum	64,577	< 0.1%	A Portuguese (machine-translated) version of XL-Sum, a diverse dataset for abstractive summarization.
Dolly 15K	28,401	< 0.1%	A Portuguese (machine-translated) version of Dolly 15K, an open-source dataset of instruction-following records generated by human annotators.
CosmosQA	25,260	< 0.1%	A Portuguese (machine-translated) version of the CosmosQA dataset for commonsense-based reading comprehension.
ROOTS	10,740	< 0.1%	The Portuguese portion of the ROOTS corpus, a dataset spanning 59 languages.

GigaVerbo is currently hosted on Hugging Face. More information can be found in its dataset card.

GigaVerbo Text-Filter

With the help of GPT-4o, we scored over 100,000 samples of GigaVerbo in terms of their overall quality. With these scored samples dataset, we trained classifiers to help us filter GigaVerbo from low-quality samples (our text-quality dataset is also available in Hugging Face).

	Class	Precision	Recall	F1-score
LaBSE + XGBoost	Low	0.89	0.81	0.85
	High	0.92	0.96	0.94
BERTimbau	Low	0.99	0.97	0.98
	High	0.99	0.99	0.99

The table above shows the evaluation scores for both our LaBSE + XGBoost and BERTimbau-based classifiers. All these models are available on Hugging Face.

Tucano

Like many other studies, we used a decoder-only Transformer based on the Llama architecture as the basis for our models. For convenience, we eventually refer to these models as Tucano small, medium, large, and XL, which respectively correspond to the 160m, 630m, 1b1, and 2b4 models.

\(n_{param}\)	\(n_{layers}\)	\(d_{model}\)	\(d_{mlp}\)	\(n_{heads}\)	\(n_{KV-heads}\)	\(d_{head}\)	\(c_{length}\)
162,417,408	12	768	3,072	12	12	64	2048
630,253,568	14	2,048	4,096	16	4	128	2048
1,100,048,384	22	2,048	5,632	32	4	64	2048
2,444,618,240	24	2,560	10,240	16	4	160	4096

Each model has a vocabulary size of 32,000. Tucano-160m, 630m, and 1b1 were trained with a context window of 2048 tokens, while the largest model (2b4) was trained with sequences of length 4096. All models uese the TeenyTinyLlama tokenizer.

This image illustrates the training loss and perplexity of Tucano models.

All logs from our training runs recorded the loss, evaluation loss, the current value of the learning rate, and the gradient norm for that specific optimization step. These are all available on GitHub.

Benchmarks

During training, we saved several checkpoints for each model at intervals of approximately 10.5 billion tokens. For every checkpoint, we employed a comprehensive evaluation harness. This evaluation harness contains several Portuguese native evaluations and English benchmarks machine-translated into Portuguese.

Benchmark	n-shot	Origin	Type	Metric
ENEM	3-shot	Native	Q&A	`acc`
BLUEX	3-shot	Native	Q&A	`acc`
OAB Exams	3-shot	Native	Q&A	`acc`
ASSIN2 RTE	15-shot	Native	Entailment	`f1 macro`
ASSIN2 STS	10-shot	Native	Similarity	`pearson`
FAQUAD NLI	15-shot	Native	Entailment	`f1 macro`
HateBR	25-shot	Native	Classification	`f1 macro`
PT Hate Speech	25-shot	Native	Classification	`f1 macro`
TweetSentBR	25-shot	Native	Classification	`f1 macro`
CALAME-PT	0-shot	Native	Next Word Prediction	`acc`
ARC-Challenge	25-shot	Translated	Q&A	`acc norm`
HellaSwag	10-shot	Translated	Q&A	`acc norm`
TruthfulQA	0-shot	Translated	Q&A	`bleurt`
LAMBADA	0-shot	Translated	Next Word Prediction	`acc`

To learn how to replicate our usage of this harness, please visit the evaluation section of our GitHub repository.

Our evaluations revealed that for several benchmarks, token ingestions (i.e., how long a model is pre-trained on new tokens) seem to not be correlated to benchmark performance. Hence, we hypothesize that results showing good performance on such benchmarks (i.e., above what a random guesser would achieve) might indicate not language modeling pretraining but overfitting to the style of evaluation these benchmarks bring.

For example, while certain benchmarks show no change in performance, regardless of the number of tokens ingested:

Evaluation scores as a function of token ingestion for the BLUEX benchmark.

Other benchmarks prove to be a indicator of model improvement in terms of its language modeling capabilities as pretraining progresses:

Evaluation scores as a function of token ingestion for the CALAME-PT benchmark.

Results

Focusing only on the benchmarks that showed a significant correlation between language modeling pretraining and performance, we see that our largest models outperformed several multilingual and natively pre-trained LLMs across nearly all benchmarks, including the recently released Llama-3.2-1b. Our models also outperformed larger multilingual models, such as Bloom-1b7.

	Average	Calame-PT	Lambada-PT	ARC-PT	HellaSwag-PT
Llama-3.2-3B	52	58.43	49.1	43.25	57.2
Tucano-2b4	43.58	59.06	37.67	30.43	47.17
Llama-3.2-1B	42.95	51.83	41.02	33.5	45.44
Tucano-1b1	41.55	58.24	34.7	30.43	42.84
Gemma-2b	40.38	51.16	39.88	37.95	32.53
Bloom-1b7	40.37	55.64	31.98	30.34	43.52
Tucano-630m	39.5	56.55	33.13	28.89	39.41
Gemma-2-2b	39.21	56.7	47.1	24.19	28.85
Bloom-1b1	38.18	52.94	30.22	29.83	39.74
GlórIA-1b3	36.05	52.79	27.71	26.67	37.04
Tucano-160m	35.14	52.31	28.16	27.01	33.07
Xglm-564m	34.55	50.58	27.42	25.56	34.64
Bloom-560m	34.32	49.95	25.44	24.74	37.15
TTL-460m	33.78	49.42	23.29	29.4	33
mGPT-1b3	31.81	47.14	29.92	23.81	26.37
TTL-160m	30.78	46.72	20.98	26.15	29.29
Lola-v1	30.19	26.4	18.32	30.42	45.61
GPorTuguese	28.92	40.61	22.98	22.48	29.62

Meanwhile, our custom Portuguese evaluation, based on the AlpacaEval methodology, demonstrates that our instruct models, Tucano-1b1-Instruct and Tucano-2b4-Instruct, produce outputs preferred over much larger models such as Sabiá-7b and Gervásio-7b.

	Avg. Length	Wins	Base Wins	Total Matches	Length-Controlled Win Rate (%)	LC Std. Error
Llama-3.2-3B-Instruct	1609	257	548	805	21.06	0.075
Tucano-2b4-Instruct	1843	151	654	805	13.00	0.071
Tucano-1b1-Instruct	1667	124	681	805	8.80	0.083
Llama-3.2-1B-Instruct	1429	99	706	805	7.15	0.057
TeenyTinyLlama-460m-Chat	1333	28	777	805	2.84	0.059
Sabiá-7b	5011	1	804	805	0.076	0.0043
Gervásio-7b	5740	1	804	805	0.026	0.0016

All evaluations for all benchmarks that form our custom harness are available on our GitHub repository.

How To Use

All our models are available on Hugging Face, and can easily be used by any application in compliance with the Apache 2.0 license.

The Tucano series significantly contributes to the Portuguese NLP community in several ways. All models, along with intermediary checkpoints, datasets, code implementations, and logs, are freely accessible through the repositories associated with this study, setting the Tucano series apart from several other works.

	Model	Data	Code	Logs	#models	#ckpts
Tucano🟢🟡🔵	✅	✅	✅	✅	6	111
TeenyTinyLlama	✅	✅	✅	✅	3	70
GPorTuguese	✅	✅	✅	✅	1	1
PTT5	✅	✅	✅	❌	6	1
RoBERTaLexPT	✅	✅	❌	❌	2	3
Albertina	✅	✅	❌	❌	8	1
BERTimbau	✅	✅	❌	❌	2	1
DeBERTinha	✅	✅	❌	❌	1	1
Gervásio	✅	✅	❌	❌	2	1
PTT5-v2	✅	❌	❌	❌	4	1
BERTabaporu	✅	❌	❌	❌	2	1
Glória	✅	❌	❌	❌	1	1
Sabiá	✅	❌	❌	❌	1	1
Sabiá-2	❌	❌	❌	❌	2	None
Sabiá-3	❌	❌	❌	❌	1	None

In terms of open (and reproducible) development, many aspects of past studies are indeed closed. Given the level of computing needed to practice deep learning at such scales, a lack of reusable code and materials can seriously slow down the Portuguese NLP community's progress while hindering its sustainability. With Tucano and GigaVerbo's development, we hope to make this scenario more sustainable and open.

Aknowlegments

We gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

BibTeX


@misc{correa2024tucanoadvancingneuraltext,
      title={{Tucano: Advancing Neural Text Generation for Portuguese}}, 
      author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
      year={2024},
      eprint={2411.07854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07854}, 
}

@article{correa2025tucanoadvancingneuraltext,
    title={{Tucano: Advancing Neural Text Generation for Portuguese}},
    author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
    journal={Patterns},
    publisher={Elsevier},
    year={2025},
    doi={10.1016/j.patter.2025.101325},
    url={https://doi.org/10.1016/j.patter.2025.101325},
    issn={2666-3899}
}

Tucano: Advancing Neural Text Generation for Portuguese

Tucano is a series of natively pre-trained open-source Portuguese language models. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development.

Abstract

An Anthology of Portuguese LLM Development

GigaVerbo

GigaVerbo Text-Filter

Tucano

Benchmarks

Results

How To Use

Aknowlegments

BibTeX