Datasets
Standard Dataset
SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset
- Citation Author(s):
- Submitted by:
- Abbas Cheddad
- Last updated:
- Tue, 11/22/2022 - 08:03
- DOI:
- 10.21227/0dsh-8x30
- Data Format:
- Research Article Link:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes
across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’
performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is
twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-
resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover
some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of
contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The
word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date
of baptism, father’s first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.).
Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in
SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be
used for competitions dedicated to a large set of document analysis problems, including word spotting.