Nasal Mucosa Cell Dataset (NMCD)

Citation Author(s):
Mauro Giuseppe
Camporeale
Università degli Studi di Bari Aldo Moro
Giovanni
Dimauro
Università degli Studi di Bari Aldo Moro
Matteo
Gelardi
Policlinico Universitario di Foggia
Giorgia
Iacobellis
Università di Torino
Mattia Sebastiano
Ladisa
Università di Torino
Sergio
Latrofa
Università di Pisa
Nunzia
Lomonte
Università degli Studi di Bari Aldo Moro
Submitted by:
Giovanni Dimauro
Last updated:
Fri, 05/10/2024 - 09:28
DOI:
10.21227/0erx-zn98
Data Format:
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Nasal Cytology, or Rhinology, is the subfield of otolaryngology, focused on the microscope observation of samples of the nasal mucosa, aimed to recognize cells of different types, to spot and diagnose ongoing pathologies. Such methodology can claim good accuracy in diagnosing rhinitis and infections, being very cheap and accessible without any instrument more complex than a microscope, even optical ones. Mucosa samples are taken non-invasively, just using a simple swab, to be then smeared onto a glass (fixation) and coloured with staining (in the case of the NMCD dataset the May-Grunwald-Giemsa) before being observed at the microscope.

The construction of the NCD dataset is the result of intense work and collaboration between otolaryngologists and computer scientists who, convinced of the great contribution that artificial intelligence can make to this branch of medicine, decided to make material available to the scientific community to allow them to challenge and confront each other in this new application field.

 

In this dataset 10 different entities are identified, that are distinguishable by some specific characteristic:

  • Epithelial Cells: main components of nasal mucosa, constituting 80% of the observed cytotype in health patients. Their presence is not associated with ongoing pathologies.
  • Ciliated cells: belonging to the epithelium cells family, these cells are characterized by their ”tailed-like” shape.
  • Metaplastic cells: also belonging to the epithelium cells family, mataplastic cells are characterized by their round shape. Their presence is usually associated with ongoing inflammatory reaction.
  • Muciparous: calciform mucous-secreting cells characterized by a bilobed shape with chromatin reinforced membrane. The increase of muciparous cells results in increased mucus production, a symptom of nasal pathologies with chronic trends, like, in example, dust mites allergies.
  • Neutrophils: granulocytes with several nucleoli and a round shape. Their main function is the phagocytosis of germs. An increase in their number should always be kept under control as an immune response indicator.
  • Eosinophils: polynuclear granulocytes, slightly large than neutrophils. The MGG staining tends to highlight the eosinophil grains within them in an orange color. Allergic diseases are associated with an increase in their population.
  • Lymphocytes: white blood cells responsible for the immune response. Their large nucleus is surrounded by a thin cytoplasmatic ”light blue” rim.
  • Mast-cells: large oval cells having their nuclei covered with basophil granules of intense color. Their presence in the nasal mucosa is caused by ongoing allergies.
  • Ematia (Erythrocyte): red blood cells whose occurrence in rhinological specimen may be due to pathologies or previous internal nose wounds, or even to small blood losses during the smear process.
  • Artifacts: with this name, are classified all objects with morphology similar to the one of a cell but not being onet. Examples of artifacts may be pollen pieces or dirt spots on the slide. 

Data were sampled from 14 rhinological slides collected at the Rhinology Clinic of the Otolaryngology Department of the University of Bari. Collecting technique was the direct smear and staining was the MGG. An optical microscope ProWay XSZPW208T with 1000x zoom, equipped with a 3MP DCE-PW300 camera was used to acquire 50 images (microscope fields) from each slide: this specific quantity has been chosen since it is the one defined in the rhino cytology protocol.

Thus 700 images with a size of 1024×768 were obtained. The image annotations were created by experts, using the Roboflow platform, analyzing each image individually, annotating and labeling each cell. During such phase, a dropping policy was followed, discarding images where were detected:

  1. sampling noise (i.e. dirt on the slide or blurred photos)
  2. duplication of large cytoplasmic areas already present in other images
  3. too dense and confused clusters of cells, typically discarded by nasal cytologist.

A total of 200 cytological fields were pruned, ending up with 500 images. A Bounding Box (BB) was manually drawn on each cell in the images, to which a label was attached to specify the class the cell belonged to. Being cells generally round, the smallest rectangular area that enclose them was marked as their bounding box.

It is hence possible to find overlaps between BBs in images, owed by the proximity between the cells and the rectangular structure of the box. Labeling operations produced more than 10,000 BBs corresponding to cells. Thanks to Roboflow, annotations were made available in any standard annotation format required for computer vision algorithms, like Pascal Voc, Coco, Tensorflow and Yolo. The 500 microscopic fields images were divided into training, validation and test set (80%-10%-10%) using the stratified holdout strategy to maintain the same class distribution within the three sets.

 

Instructions: 

The NMCD dataset contains 500 images of cytological fields of size of 1024×768. In each image 1 or more Bounding Boxes (BBs) were manually drawn on each cell in the images, and to each BB corresponds a label that specify the class the cell belonged to.

The criteria for drawing the BB was to identify the smallest rectangular area that encloses a cell. Thus can happen that in image where cells are located very close one to each other, BBs overlap.The BBs were created by experts, using the Roboflow platform, analyzing each image individually, annotating and labeling each cell.

Data were sampled from 14 rhinological slides collected at the Rhinology Clinic of the Otolaryngology Department of the University of Bari.

Collecting technique was the direct smear and staining was the MGG.

An optical microscope ProWay XSZPW208T with 1000x zoom, equipped with a 3MP DCE-PW300 camera was used to acquire 50 images (microscope fields) from each of the 14 slides: this specific quantity has been chosen since it is the one defined in the rhino cytology protocol.

700 images with a size of 1024×768 were obtained, but 200 were discarded, thus reaching the final number of 500 images. Images were discarded because of one or more of the following motivations:

  1. sampling noise (i.e. dirt on the slide or blurred photos)
  2. duplication of large cytoplasmic areas already present in other images
  3. too dense and confused clusters of cells, typically discarded by nasal cytologist.

The dataset was carefully splitted between train, validation and test set applying the stratified hold-out technique, holding out 10% of the total number of images for the validation set and another 10% for the test set, leaving the training set with the remaing 80%; this was done while being careful on preserving as much as possible the original distribution of the various classes in the 3 sets.

 

Between the 500 (400/50/50) images The number of the annotations (BB) in this dataset is 10847 (8708/1082/1057), partitioned in 10 different classes as follows:

  • Epithelial Cells: 5058 (4061/503/494)
  • Ciliated cells: 115 (93/11/11)
  • Metaplastic cells: 228 (184/23/21)
  • Muciparous: 504 (403/51/50)
  • Neutrophils: 3212 (2581/323/308)
  • Eosinophils: 528 (420/54/54)
  • Lymphocytes: 117 (94/11/12)
  • Mast-cells: 19 (15/2/2)
  • Ematia: 48 (37/6/5)
  • Artifacts: 1018 (820/98/100)

The dataset is provided in 4 different very common annotation formats: coco, yolov8, tensorflow and pascal voc. For each one of them a zip archive is provided; following are listed the data structure and annotation format for each zip.

  • NMCD.tensorflow.zip
    • test
      • annotations.csv (list of filename,width,height,class,xmin,ymin,xmax,ymax for each BB in each image of the folder)
      • img0001.jpg
      • ...
      • img9999.jpg
    • train
    • valid

 

  • NMCD.yolov8.zip
    • test
      • images
        • img0001.jpg 
        • ...
        • img9999.jpg
      • labels
        • img0001.txt (list of label x_center y_center width height for each BB in the image with the same filename)
        • ...
        • img9999.txt
    • train
    • valid
    • data.yaml

 

  • NMCD.voc.zip
    • test
      • img0001.jpg
      • img0001.xml (the xml file has many tags, including a list of BBs in the image with the same filename, for each BB is specified the label, xmin, ymin, xmax, ymax
      • ...
      • img9999.jpg
      • img9999.xml
    • train
    • valid

 

  • NMCD.coco.zip
    • test
      • _annotations.coco.json (json tags specifying various informations, including a list of label,xmin,ymin,width and height for each BB in each image of the folder)
      • img0001.jpg
      • ...
      • img9999.jpg
    • train
    • valid

The acquisition and annotation of this dataset has required a lot of work without any remuneration. We can provide it free of charge, but we ask those who intend to use our dataset the courtesy to quote the following papers (thanks in advance):

  • Dimauro, G., Barbaro, N., Camporeale, M. G., Fiore, V., Gelardi, M., & Scalera, M. (2024). DeepCilia: Automated, deep-learning based engine for precise ciliary beat frequency estimation. Biomedical Signal Processing and Control, 90, 105808. https://doi.org/10.1016/j.bspc.2023.105808
  • Camporeale, M., Dimauro, G., Gelardi, M., Iacobellis, G., Ladisa, M. S., Latrofa, S., & Lomonte, N. (2024). A Nasal Cytology Dataset for Object Detection and Deep Learning (arXiv:2404.13745). arXiv. https://doi.org/10.48550/arXiv.2404.13745