Tucano: Advancing Neural Text Generation for Portuguese


University of Bonn
1Center for Science and Thought

2High Performance Computing and Analytics Lab

3Helmholtz-Institut für Strahlen- und Kernphysik

4Institute for Science and Ethics

5Institute for Computer Science

arXiv Code Hugging Face

Tucano is a series of natively pre-trained open-source Portuguese language models. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development.

Abstract

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face.

Our study brings the following advancements to the Portuguese NLP community:

  1. The concatenation of a larger and more high-quality dataset for Portuguese language modeling (GigaVerbo).
  2. The development of learned filters and datasets to improve text pre-processing for Portuguese.
  3. Pushing self-supervised pretraining beyond the 500 billion tokens mark for Portuguese monolingual models.
  4. The development of new, low-resource, efficient, and effective open-source language models for Portuguese (Tucano).
  5. A critical assessment and comparison of currently available benchmarks for Portuguese language models.

An Anthology of Portuguese LLM Development

Our study provides a historical overview of Portuguese LLM development from 2020 to October 2024, allowing readers to better contextualize our and past works.

This image illustrates several Portuguese language model releases from 2020 to June 2024.

GigaVerbo

GigaVerbo contains over 145 million documents, amounting to 780 GB of text. Much like other text datasets, Gigaverbo was formed by concatenating several portions of openly available datasets for Portuguese and deduplicating their summation.

Our dataset is composed of the following subsets:

Subset Nº of Samples % Description
monoHPLT-PT 58,244,012 40.09% The clean and deduplicated Portuguese portion of the High-Performance Language Technologies resources dataset.
CrawlPT 43,846,974 30.17% A deduplicated Portuguese corpus extracted from various web pages, concatenated from CC-100, Oscar, and BrWaC.
Multilingual-C4 16,092,571 11.07% The Brazilian Portuguese cleaned portion of the m-C4 dataset.
Common Crawl 12,470,998 8.58% A clean and deduplicated snapshot of the Common Crawl dataset (CC-MAIN-2023-23).
BlogSet-BR 4,321,181 2.97% A collection of blog posts written in Brazilian Portuguese.
Instruct-PTBR 2,962,856 2.04% A mix of multiple instruction datasets for various tasks, machine-translated from English to Brazilian Portuguese.
Corpus Carolina 2,075,395 1.43% An open corpus with varied typology in contemporary Brazilian Portuguese.
UltrachatBR 1,255,091 0.86% A Portuguese version (machine-translated) of the Ultrachat dataset.
Wikipedia 1,101,475 0.76% Cleaned Portuguese articles built from the Wikipedia dumps.
CulturaX 999,994 0.69% The Portuguese portion of CulturaX, a multilingual dataset with 167 languages.
LegalPT 925,522 0.64% A concatenation of publicly available legal data in Portuguese, including legislation, jurisprudence, and legal articles.
Gpt4All 808,803 0.56% A Portuguese (machine-translated) version of the Gpt4All dataset.
Bactrian-X 66,994 < 0.1% The Portuguese portion of Bactrian-X, a collection of instruction-response pairs in 52 languages.
XL-Sum 64,577 < 0.1% A Portuguese (machine-translated) version of XL-Sum, a diverse dataset for abstractive summarization.
Dolly 15K 28,401 < 0.1% A Portuguese (machine-translated) version of Dolly 15K, an open-source dataset of instruction-following records generated by human annotators.
CosmosQA 25,260 < 0.1% A Portuguese (machine-translated) version of the CosmosQA dataset for commonsense-based reading comprehension.
ROOTS 10,740 < 0.1% The Portuguese portion of the ROOTS corpus, a dataset spanning 59 languages.

GigaVerbo is currently hosted on Hugging Face. More information can be found in its dataset card.

GigaVerbo Text-Filter

With the help of GPT-4o, we scored over 100,000 samples of GigaVerbo in terms of their overall quality. With these scored samples dataset, we trained classifiers to help us filter GigaVerbo from low-quality samples (our text-quality dataset is also available in Hugging Face).

Class Precision Recall F1-score
LaBSE + XGBoost Low 0.89 0.81 0.85
High 0.92 0.96 0.94
BERTimbau Low 0.99 0.97 0.98
High 0.99 0.99 0.99

The table above shows the evaluation scores for both our LaBSE + XGBoost and BERTimbau-based classifiers. All these models are available on Hugging Face.

Tucano

Like many other studies, we used a decoder-only Transformer based on the Llama architecture as the basis for our models. For convenience, we eventually refer to these models as Tucano small, medium, large, and XL, which respectively correspond to the 160m, 630m, 1b1, and 2b4 models.

\(n_{param}\) \(n_{layers}\) \(d_{model}\) \(d_{mlp}\) \(n_{heads}\) \(n_{KV-heads}\) \(d_{head}\) \(c_{length}\)
162,417,408 12 768 3,072 12 12 64 2048
630,253,568 14 2,048 4,096 16 4 128 2048
1,100,048,384 22 2,048 5,632 32 4 64 2048
2,444,618,240 24 2,560 10,240 16 4 160 4096

Each model has a vocabulary size of 32,000. Tucano-160m, 630m, and 1b1 were trained with a context window of 2048 tokens, while the largest model (2b4) was trained with sequences of length 4096. All models uese the TeenyTinyLlama tokenizer.

This image illustrates the training loss and perplexity of Tucano models.

All logs from our training runs recorded the loss, evaluation loss, the current value of the learning rate, and the gradient norm for that specific optimization step. These are all available on GitHub.

Benchmarks

During training, we saved several checkpoints for each model at intervals of approximately 10.5 billion tokens. For every checkpoint, we employed a comprehensive evaluation harness. This evaluation harness contains several Portuguese native evaluations and English benchmarks machine-translated into Portuguese.

Benchmark n-shot Origin Type Metric
ENEM 3-shot Native Q&A acc
BLUEX 3-shot Native Q&A acc
OAB Exams 3-shot Native Q&A acc
ASSIN2 RTE 15-shot Native Entailment f1 macro
ASSIN2 STS 10-shot Native Similarity pearson
FAQUAD NLI 15-shot Native Entailment f1 macro
HateBR 25-shot Native Classification f1 macro
PT Hate Speech 25-shot Native Classification f1 macro
TweetSentBR 25-shot Native Classification f1 macro
CALAME-PT 0-shot Native Next Word Prediction acc
ARC-Challenge 25-shot Translated Q&A acc norm
HellaSwag 10-shot Translated Q&A acc norm
TruthfulQA 0-shot Translated Q&A bleurt
LAMBADA 0-shot Translated Next Word Prediction acc

To learn how to replicate our usage of this harness, please visit the evaluation section of our GitHub repository.

Our evaluations revealed that for several benchmarks, token ingestions (i.e., how long a model is pre-trained on new tokens) seem to not be correlated to benchmark performance. Hence, we hypothesize that results showing good performance on such benchmarks (i.e., above what a random guesser would achieve) might indicate not language modeling pretraining but overfitting to the style of evaluation these benchmarks bring.

For example, while certain benchmarks show no change in performance, regardless of the number of tokens ingested:

Evaluation scores as a function of token ingestion for the BLUEX benchmark.

Other benchmarks prove to be a indicator of model improvement in terms of its language modeling capabilities as pretraining progresses:

Evaluation scores as a function of token ingestion for the CALAME-PT benchmark.

Results

Focusing only on the benchmarks that showed a significant correlation between language modeling pretraining and performance, we see that our largest models outperformed several multilingual and natively pre-trained LLMs across nearly all benchmarks, including the recently released Llama-3.2-1b. Our models also outperformed larger multilingual models, such as Bloom-1b7.

Average Calame-PT Lambada-PT ARC-PT HellaSwag-PT
Llama-3.2-3B 52 58.43 49.1 43.25 57.2
Tucano-2b4 43.58 59.06 37.67 30.43 47.17
Llama-3.2-1B 42.95 51.83 41.02 33.5 45.44
Tucano-1b1 41.55 58.24 34.7 30.43 42.84
Gemma-2b 40.38 51.16 39.88 37.95 32.53
Bloom-1b7 40.37 55.64 31.98 30.34 43.52
Tucano-630m 39.5 56.55 33.13 28.89 39.41
Gemma-2-2b 39.21 56.7 47.1 24.19 28.85
Bloom-1b1 38.18 52.94 30.22 29.83 39.74
GlórIA-1b3 36.05 52.79 27.71 26.67 37.04
Tucano-160m 35.14 52.31 28.16 27.01 33.07
Xglm-564m 34.55 50.58 27.42 25.56 34.64
Bloom-560m 34.32 49.95 25.44 24.74 37.15
TTL-460m 33.78 49.42 23.29 29.4 33
mGPT-1b3 31.81 47.14 29.92 23.81 26.37
TTL-160m 30.78 46.72 20.98 26.15 29.29
Lola-v1 30.19 26.4 18.32 30.42 45.61
GPorTuguese 28.92 40.61 22.98 22.48 29.62

Meanwhile, our custom Portuguese evaluation, based on the AlpacaEval methodology, demonstrates that our instruct models, Tucano-1b1-Instruct and Tucano-2b4-Instruct, produce outputs preferred over much larger models such as Sabiá-7b and Gervásio-7b.


Avg. Length Wins Base Wins Total Matches Length-Controlled Win Rate (%) LC Std. Error
Llama-3.2-3B-Instruct 1609 257 548 805 21.06 0.075
Tucano-2b4-Instruct 1843 151 654 805 13.00 0.071
Tucano-1b1-Instruct 1667 124 681 805 8.80 0.083
Llama-3.2-1B-Instruct 1429 99 706 805 7.15 0.057
TeenyTinyLlama-460m-Chat 1333 28 777 805 2.84 0.059
Sabiá-7b 5011 1 804 805 0.076 0.0043
Gervásio-7b 5740 1 804 805 0.026 0.0016

All evaluations for all benchmarks that form our custom harness are available on our GitHub repository.

How To Use

All our models are available on Hugging Face, and can easily be used by any application in compliance with the Apache 2.0 license.

Example code to use Tucano models.

The Tucano series significantly contributes to the Portuguese NLP community in several ways. All models, along with intermediary checkpoints, datasets, code implementations, and logs, are freely accessible through the repositories associated with this study, setting the Tucano series apart from several other works.

Model Data Code Logs #models #ckpts
Tucano🟢🟡🔵 6 111
TeenyTinyLlama 3 70
GPorTuguese 1 1
PTT5 6 1
RoBERTaLexPT 2 3
Albertina 8 1
BERTimbau 2 1
DeBERTinha 1 1
Gervásio 2 1
PTT5-v2 4 1
BERTabaporu 2 1
Glória 1 1
Sabiá 1 1
Sabiá-2 2 None
Sabiá-3 1 None

In terms of open (and reproducible) development, many aspects of past studies are indeed closed. Given the level of computing needed to practice deep learning at such scales, a lack of reusable code and materials can seriously slow down the Portuguese NLP community's progress while hindering its sustainability. With Tucano and GigaVerbo's development, we hope to make this scenario more sustainable and open.

Aknowlegments

We gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

BibTeX


@misc{correa2024tucanoadvancingneuraltext,
      title={{Tucano: Advancing Neural Text Generation for Portuguese}}, 
      author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
      year={2024},
      eprint={2411.07854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07854}, 
}