Datasets
Standard Dataset
B-NER
- Citation Author(s):
- Submitted by:
- Md. Zahidul Haque
- Last updated:
- Fri, 02/24/2023 - 14:56
- DOI:
- 10.21227/1vw8-ap69
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization. To address this significant limitation, we introduce a novel Bangla NER dataset B-NER, which was created using 22,144 manually annotated Bangla sentences collected from Bangla newspapers and Bangla Wikipedia. This dataset includes a total of 9,895 unique words which were manually categorized into eight different entity types, such as a person, organization, event, artifact, time indicator, natural phenomenon, geopolitical entity, and geographical location. Inter-annotator agreement experiments were conducted to validate the quality of annotations performed by three annotators, resulting in a Kappa score of 0.82. In this paper, we provide an outline of the annotation guideline illustrated with examples, discuss the B-NER dataset properties, and present benchmark evaluations of the dataset. To establish that B-NER is more comprehensive and balanced in comparison to other publicly accessible datasets, we conducted cross-dataset modeling and validation, i.e. trained NER model on one dataset while tested on another, and found that the model trained on B-NER performed the best in that settings. Furthermore, we performed exhaustive benchmark evaluations based on Bidirectional LSTM with fastText embeddings and sentence transformer models. Among these models, fine-tuned IndicBERT achieved noticeable results with a Macro-F1 of 86%. This dataset and baseline results will be publicly available under a CC-BY 4.0 license in the CoNLL-2002 format to facilitate further research on Bangla NER.
B-NER annotation process employing the BIO tagging technique. If the word is not a proper noun, an "O" or outside tag is used. Otherwise, the named entity will consist of two or more words or tokens, with the first word tagged with "B" for beginning and the remaining words being "I" for inside tag.
Dataset Files
- b-ner.csv (6.95 MB)
- all datasets data.rar (19.24 MB)
- git repo B-NER-main.zip (114.64 kB)