PT7 Web is an annotated Portuguese language Corpus built from samples collected from Sep 2018 to Mar 2020 from seven Portuguese-speaking countries: Angola, Brazil, Portugal, Cape Verde, Guinea-Bissau, Macao e Mozambique. The records were filtered from Common Crawl — a public domain petabyte-scale dataset of webpages in many languages, mixed together in temporal snapshots of the web, monthly available [1]. The Brazilian pages were labeled as the positive class and the others as the negative class (non-Brazillian Portuguese).


We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.