AzerNewsV1: Azerbaijani News Classification Dataset

Citation Author(s):
Samir
Rustamov
Fuad
Hajiyev
Atabay
Ziyaden
Amir
Yelenov
Alexandr
Pak
Submitted by:
Atabay Ziyaden
Last updated:
Fri, 09/15/2023 - 09:34
DOI:
10.21227/h36a-8w35
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Our dataset encompasses a comprehensive collection of Azerbaijani news texts from the Azertac (https://azertag.az/) State Agency, drawn from a variety of news articles. 

Azertac, established on March 1, 1920, was recognized as a pioneering entity within the framework of international information agencies. It has played a pivotal role in the establishment and coordination of various associations, including the Association of National Information Agencies comprising nations affiliated with the Commonwealth of Independent States, the Association of News Agencies representing Turkish-speaking countries, and the Association of National News Agencies associated with countries participating in the Black Sea Economic Cooperation Organization. AZERTAC has engaged in collaborative endeavors with several renowned news agencies to foster global information exchange and cooperation. This extensive network of collaborations underscores Azertac's global reach and influence in international news dissemination.

The dataset comprises approximately three million rows, with each row representing a sentence extracted from diverse Azerbaijani news sources. These sentences cover a wide spectrum of subjects, including but not limited to politics, the economy, culture, sports, technology, and health. The Labeled dataset, which has been posted and publicly shared in the link, is organized to facilitate rigorous analysis and classification tasks, with essential metadata provided for each sentence.

The dataset is enriched with crucial metadata attributes that enhance its utility and applicability to various research tasks:

  • News Category: Each sentence is linked to a specific news category, covering subjects such as politics, economy, culture, sports, technology, and health.
  •     
  • News Subcategory: Further enhance granularity, each sentence is classified into a subcategory, enabling fine-tuned analysis and specialized classification tasks.
  •     
  • News Index: A unique identifier for each news article maintains the dataset integrity and supports cross-referencing.
  •     
  • News Sentence Order: Sequential order aids in preserving sentence context, which is essential for text generation and summarization.
  •     
  • Link: Hyperlinks to original articles provide direct access for researchers to delve into the sentence context.
  •     
  • Sentence: The core textual content, which varies in length and complexity, covers a spectrum of linguistic styles and themes.

 

 

Instructions: 

Dataset is presented in single csv file.

The dataset is enriched with crucial metadata attributes that enhance its utility and applicability to various research tasks:

  • News Category: Each sentence is linked to a specific news category, covering subjects such as politics, economy, culture, sports, technology, and health.
  •     
  • News Subcategory: Further enhance granularity, each sentence is classified into a subcategory, enabling fine-tuned analysis and specialized classification tasks.
  •     
  • News Index: A unique identifier for each news article maintains the dataset integrity and supports cross-referencing.
  •     
  • News Sentence Order: Sequential order aids in preserving sentence context, which is essential for text generation and summarization.
  •     
  • Link: Hyperlinks to original articles provide direct access for researchers to delve into the sentence context.
  •     
  • Sentence: The core textual content, which varies in length and complexity, covers a spectrum of linguistic styles and themes.