Our study brings the following advancements to the Portuguese NLP community:
- The concatenation of a larger and more high-quality dataset for Portuguese language modeling (GigaVerbo).
- The development of learned filters and datasets to improve text pre-processing for Portuguese.
- Pushing self-supervised pretraining beyond the 500 billion tokens mark for Portuguese monolingual models.
- The development of new, low-resource, efficient, and effective open-source language models for Portuguese (Tucano).
- A critical assessment and comparison of currently available benchmarks for Portuguese language models.