Global News 60K

Citation Author(s):: Luigi Serreli (Università degli Studi di Cagliari Facoltà di Ingegneria e Architettura)

Claudio Marche (Università degli Studi di Cagliari Facoltà di Ingegneria e Architettura)

Michele Nitti (Università degli Studi di Cagliari Facoltà di Ingegneria e Architettura)
Submitted by:: Luigi Serreli
Last updated:: Sun, 12/22/2024 - 10:18
DOI:: 10.1109/OJCS.2024.3519747
Data Format:: *.csv
Research Article Link:: Reducing Data Volume in News Topic Classification: Deep Learning Framework and …
Links:: Reducing Data Volume in News Topic Classification: Deep Learning Framework and …

322 views

Categories:

Keywords:

Topic classification

Text classification; Natural language processing; Deep Learning;

ACCESS DATASET CITE

Abstract

Text classification systems have become increasingly important in recent years due to the explosion of online documents and the need to sort them for specific services. One of the most critical issues in text classification is the limited availability and diversity of datasets, which can lead to overfitting and poor generalization. In this context, we present a new dataset named Global News 60K (GN60K), which consists of 60,000 news articles from different sources from different parts of the world, covering 10 topics. The dataset provides a rich vocabulary, avoids overfitting problems, and creates better-generalized models.

The topics included in the dataset are Politics, Sports, Entertainment, Science and Technology, Business, Health, Environment, Education, Arts and Culture, and Crime. We selected these topics because they cover a wide range of interests and are commonly used in text classification applications. To further increase the dataset's diversity, we considered articles from different parts of the world, including North America, Europe, Asia, Africa, and South America.

The articles were selected based on their publication dates, which range from 2022 and 2023.

We believe that our dataset will be valuable for researchers and practitioners working on text/topic classification tasks. The GN60K dataset provides a diverse and well-labelled set of documents that can be used for training and testing various machine learning models. Additionally, the dataset can be used to develop new algorithms for topic classification, and related tasks. We hope that our dataset will contribute to the advancement of the text classification field and foster new research ideas.

The GN60K dataset was presented and introduced in the publication:
L. Serreli, C. Marche, and M. Nitti, "Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset," in IEEE Open Journal of the Computer Society.

The full paper can be cited as:

Plain text format:
L. Serreli, C. Marche, and M. Nitti, "Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset," in IEEE Open Journal of the Computer Society, doi: 10.1109/OJCS.2024.3519747. BibTeX format:@ARTICLE{10806791, author={Serreli, Luigi and Marche, Claudio and Nitti, Michele}, journal={IEEE Open Journal of the Computer Society}, title={Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset}, year={2024}, volume={}, number={}, pages={1-12}, keywords={Feature extraction;Text categorization;Classification algorithms;Accuracy;Bidirectional long short term memory;Vectors;Computational modeling;Numerical models;Nearest neighbor methods;Radio frequency;Data volume;Deep learning;Natural language processing;Topic classification}, doi={10.1109/OJCS.2024.3519747}}

Instructions:

Data Format

The dataset is provided in CSV format, with one row per news article. Each row contains the following fields:

· TITLE: Title of the news article.

· TEXT: Content of the news article.

· TOPIC: Topic of the news article.

List of Topics

This dataset contains a collection of news articles labelled with one of 10 topics. The topics, listed in alphabetical order, are: Arts & Culture, Business & Economy, Crime & Security, Entertainment & Celebrity, Health & Education, Politics, Science, Sports, Tech, and Weird News.

Sources

The dataset was constructed using various sources. These sources are listed below, along with their names, countries, and the topics acquired from them.

Breitbart.com	USA	Politics, Sports, Business & Economy, Tech, Entertainment & Celebrity
Bristolpost.co.uk	UK	Entertainment & Celebrity, Health & Education, Crime & Security
Cnet.com	USA, UK, AUS	Politics, Tech
Csmonitor.com	USA	Science, Arts & Culture
Dailycoller.com	USA	Business & Economy, Entertainment & Celebrity, Health & Education, Sports, Politics
Mirror.co.uk	UK	Crime & Security, Weird

Funding Agency

This work has been partially funded by the Ministero dell’Istruzione, dell’Universita e della Ricerca (MIUR) with the PON “Ricerca e Innovazione” 2014-2020 (PON RI) “Azione IV.5 Dottorati su tematiche green”, assigned with D.M. 1062 on 10.08.2021.