Readability-Aware Summarization Dataset for Turkish

Citation Author(s):: Mehmet Samet Duran

Tevfik Aytekin
Submitted by:: Tevfik Aytekin
Last updated:: Wed, 03/26/2025 - 10:08
DOI:: 10.21227/anmp-va09

69 views

Categories:

Artificial Intelligence

Keywords:

large language models

summarization

ACCESS DATASET CITE

Abstract

This dataset is constructed in a study that addresses the gap between text summarization and content readability for diverse Turkish-speaking audiences. It contains paired original texts and corresponding summaries optimized for different readability levels using the YOD (Yeni Okunabilirlik Düzeyi) formula.

YOD Readibility Metric: Bezirci-Yılmaz readability formula defines the YOD readability metric specifically designed for Turkish texts. It calculates the readability score based on the average number of polysyllabic words (three or more syllables) per sentence. The metric assigns weights to these polysyllabic words and combines them with the average sentence length, providing an assessment of text complexity. You can read the related paper for more information https://arxiv.org/abs/2503.10675

Dataset Creation Logic: To create the dataset, VBART-Large-Paraphrasing model was employed to enhance the existing datasets by generating paraphrased variations at both the sentence and full- text levels. This approach permitted the derivation of content with a more extensive range of YOD values, encompassing both higher and lower values, from the same source material. To maintain semantic integrity, each paraphrase was compared to the original summary using BERTScore to verify that the synthetic data achieved the intended readability adjustments while remaining faithful to the source. In addition, ChatGPT’s API was also used for synthetic data generation, enriching the dataset with diverse and high-quality rewritten summaries.

Dataset Creation: The dataset is compiled from multiple sources: XLSUM (970 entries), TRNews (5,000 entries), MLSUM (1,033 entries), LR-SUM (1,107 entries), and Wikipedia-trsummarization (3,024 entries). Sampling is done hierarchically from longest text that can fit into the tokenizer(vngrs-ai/VBART-Large-Paraphrasing) without truncation to the shortest content. After the synthetic data generation process, the dataset is significantly expanded to include 76,759 summaries. To guarantee a thorough evaluation, 200 samples for each YOD level are allocated to both the test and validation sets, resulting in a total of 3200 examples for both test and evaluation.