Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people

Citation Author(s):: Issa Annamoradnejad

Rahimberdi Annamoradnejad
Submitted by:: Issa Annamoradnejad
Last updated:: Tue, 09/13/2022 - 10:14
DOI:: 10.21227/h1hz-wy90
Data Format:: CSV
Research Article Link:: Age dataset: A structured general-purpose dataset on life, work, and death of 1…
Links:: Kaggle dataset and related notebooks

GitHub repository

1225 views

Categories:

Keywords:

Automatic gender detection

Gender Classification

famous people dataset

ACCESS DATASET CITE

Abstract

Several fields of study can benefit from a large, structured, and accurate dataset of historical figures. Due to a lack of such a dataset, in this paper, we aim to use machine learning and text mining models to collect, predict, and cleanse online data with a focus on age and gender. We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project. The dataset is the largest on notable deceased people and includes individuals from a variety of social groups, including but not limited to 107k females, 124 non-binary people, and 90k researchers, who are spread across more than 300 contemporary or historical regions. The final product provides new insights into the demographics of mortality in relation to gender and profession in history. The technical method demonstrates the usability of the latest text mining approaches to accurately clean historical data and reduce the missing values.