HUFA PROJECT NUT ALLERGY CORPUS

Citation Author(s):
Ana
González-Moreno
Allergy Unit, Hospital Universitario Fundación Alcorcón
Alberto
Ramos-González
Computer Science and Engineering Department, Universidad Carlos III de Madrid
Israel
Gonzalez-Carrasco
Computer Science and Engineering Department, Universidad Carlos III de Madrid
M. Dolores
Alonso Diaz de Durana
Allergy Unit, Hospital Universitario Fundación Alcorcón
Beatriz Sellers
Gutierrez Argumosa
Allergy Unit, Hospital Universitario Fundación Alcorcón
Alicia
Moncada Salinero
Allergy Unit, Hospital Universitario Fundación Alcorcón
Ana B.
Pastor-Magro
Allergy Unit, Hospital Universitario Fundación Alcorcón
Beatriz
González-Piñeiro
Allergy Unit, Hospital Universitario Fundación Alcorcón
Miguel A.
Tejedor-Alonso
Allergy Unit, Hospital Universitario Fundación Alcorcón
Paloma
Martínez
Computer Science and Engineering Department, Universidad Carlos III de Madrid
Submitted by:
Israel Gonzalez...
Last updated:
Thu, 12/07/2023 - 04:36
DOI:
10.21227/15z7-gt07
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The first corpus of clinical notes on allergies in Spanish, a collection comprising 828 texts related to clinical notes of 197 patients visiting the Allergy Unit and Emergency Department Hospital Universitario Fundación Alcorcón.

The collection of texts has a total of 70.272 words and 3.938 sentences, with an average of 85 words and five sentences per note. The maximum number of words in a text is 533, and 50 sentences. The notes contain medical terms that pose a complex comprehension challenge for non-medical professionals. Clinical notes follow a different structure depending on the template used to collect patient information. The types of templates are anamnesis, personal and family history, physical examination, medical-evolution, diagnostic tests, summary of the situation, diagnosis, medical treatment, and recommendations.

The texts are written in informal clinical writing where typos, abbreviations, and incomplete sentences are found. Some clinical notes may contain results of analyses or skin tests performed on the patient. There are spelling errors, tokenization errors, and words that should not be anonymised.

This corpus was built for research and educational purposes.

Instructions: 

This repository contains the corpus, where you can find a jsonl file with the following structure:

{
  "id": "number identifier of the clinical notes",
  "text": "text of the clinical note",
  "label": "List of lists containing the position of the start and end character of the word, and the label of the entity type. For example: [[start,end, 'entity type'],[start,end,'entity type']]"
}
Funding Agency: 
MCIN
Grant Number: 
AEI/10.13039/501100011033/