Datasets
Standard Dataset
SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset
- Citation Author(s):
- Submitted by:
- Abbas Cheddad
- Last updated:
- Tue, 11/22/2022 - 08:03
- DOI:
- 10.21227/0dsh-8x30
- Data Format:
- Research Article Link:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes
across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’
performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is
twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-
resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover
some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of
contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The
word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date
of baptism, father’s first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.).
Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in
SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be
used for competitions dedicated to a large set of document analysis problems, including word spotting.
I. Description of the Data Set
This dataset is taken from the Arkiv Digital AD AB image and index database. When a child was born he or she was registered in a church record book called Birth and Christening records by the priest. They registered the name of the child, when the child was born and baptized, where the child was living and information about the father and mother of the child. The index is based on manual annotation of images from several books between the year 1800 to 1840.
The dataset consists of 191,301 index rows and 15,000 images and has been divided into
train: 133,941 index rows and 10,500 images
eval: 28,303 index rows and 2,250 images
test: 29,057 index rows and 2,250 images
Swedish county (län)
--------------------
Gävleborgs län - 23 982 index rows
Gotlands län - 9 925 index rows
Norrbottens län - 12 198 index rows
Västerbottens län - 16 118 index rows
Västernorrlands län - 21 014 index rows
Västmanlands län - 21 141 index rows
Älvsborgs län - 52 988 index rows
Örebro län - 33 935 index tows
Description of the index columns
--------------------------------
id - Arkiv Digital AD AB ID in database
index_aid - Index AID (Arkiv Digital AD AB external ID)
county - County where the child was born or registered (usually not in the image)
parish - Parish where the child was born or registered (can be written at the top of the page or entirely missing from the image)
child_first_name - Given name of the child
birth_date - Date of birth, format YYYYMMDD (on the image it is usually written DD/MM with the year on top of page)
baptism_date - Date of baptism, format YYYYMMDD (on the image it usually written DD/MM with the year on top of page)
birth_place - Place of birth
father_title - Title or occupation of the father
father_first_name - Given name of the father
father_last_name - Surname of the father
father_age - Age of the father when the child was born <== (available only in the master dataset SHIBRm)
mother_title - Title or occupation of the mother
mother_first_name - Given name of the mother
mother_last_name - Surname of the mother
mother_age - Age of the mother when the child was born
image_aid - Image AID (Arkiv Digital AD AB external ID)
image_path - Relative path to the image (images/<image_path>)
II. Use of the Materials
The users of the SHIBR Data Set must agree that:
- The use of the data set is restricted to research purpose only
- No redistribution of the dataset is allowed
- In any resultant publications of research that uses the dataset, due credits will be provided to:
Abbas Cheddad, Hüseyin Kusetogullari, Agrin Hilmkil, Lena Sundin, Amir Yavariabdi, Mustapha Aouache, Johan Hall; "SHIBR-The Swedish Historical Birth Records: A Semi-Annotated Dataset," Neural Computing & Applications, 33:15863–15875, Springer, 2021.