Datasets
Standard Dataset
Global News 60K
- Citation Author(s):
- Submitted by:
- Luigi Serreli
- Last updated:
- Tue, 10/03/2023 - 12:14
- DOI:
- 10.21227/vek7-e690
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
Text classification systems have become increasingly important in recent years due to the explosion of online documents and the need to sort them for specific services. One of the most critical issues in text classification is the limited availability and diversity of datasets, which can lead to overfitting and poor generalization. In this context, we present a new dataset named Global News 60K (GN60K), which consists of 60,000 news articles from different sources from different parts of the world, covering 10 topics. The dataset provides a rich vocabulary, avoids overfitting problems, and creates better-generalized models.
The topics included in the dataset are Politics, Sports, Entertainment, Science and Technology, Business, Health, Environment, Education, Arts and Culture, and Crime. We selected these topics because they cover a wide range of interests and are commonly used in text classification applications. To further increase the dataset's diversity, we considered articles from different parts of the world, including North America, Europe, Asia, Africa, and South America.
The articles were selected based on their publication dates, which range from 2022 and 2023.
We believe that our dataset will be valuable for researchers and practitioners working on text/topic classification tasks. The GN60K dataset provides a diverse and well-labelled set of documents that can be used for training and testing various machine learning models. Additionally, the dataset can be used to develop new algorithms for topic classification, and related tasks. We hope that our dataset will contribute to the advancement of the text classification field and foster new research ideas.
Data Format
The dataset is provided in CSV format, with one row per news article. Each row contains the following fields:
· TITLE: Title of the news article.
· TEXT: Content of the news article.
· TOPIC: Topic of the news article.
List of Topics
This dataset contains a collection of news articles labelled with one of 10 topics. The topics, listed in alphabetical order, are: Arts & Culture, Business & Economy, Crime & Security, Entertainment & Celebrity, Health & Education, Politics, Science, Sports, Tech, and Weird News.
Sources
The dataset was constructed using various sources. These sources are listed below, along with their names, countries, and the topics acquired from them.
Breitbart.com
USA
Politics, Sports, Business & Economy, Tech, Entertainment & Celebrity
Bristolpost.co.uk
UK
Entertainment & Celebrity, Health & Education, Crime & Security
Cnet.com
USA, UK, AUS
Politics, Tech
Csmonitor.com
USA
Science, Arts & Culture
Dailycoller.com
USA
Business & Economy, Entertainment & Celebrity, Health & Education, Sports, Politics
Mirror.co.uk
UK
Crime & Security, Weird