Paddy Doctor: A Visual Image Dataset for Automated Paddy Disease Classification and Benchmarking

Citation Author(s):: Petchiammal A (Manonmaniam Sundaranar University, Tirunelveli, India)

Briskline Kiruba S (Manonmaniam Sundaranar University, Tirunelveli, India)

Murugan D (Manonmaniam Sundaranar University, Tirunelveli, India)

Pandarasamy Arjunan (Singapore)
Submitted by:: Pandarasamy Arjunan
Last updated:: Thu, 02/16/2023 - 13:54
DOI:: 10.21227/hz4v-af08
Data Format:: Image files
Research Article Link:: Paddy Doctor: A Visual Image Dataset for Automated Paddy Disease Classification…
Links:: Paddy Doctor Dataset website

GitHub code repository

Code Ocean Capsule for reproducibility

13324 views

Categories:

Keywords:

Paddy crop; paddy diseases; rice diseases; Paddy Disease Classification; computer vision; deep learning; transfer learning;

ACCESS DATASET CITE

Abstract

The Paddy Doctor dataset contains 16,225 labeled paddy leaf images across 13 classes (12 different paddy diseases and healthy leaves). It is the largest expert-annotated visual image dataset to experiment with and benchmark computer vision algorithms. The paddy leaf images were collected from real paddy fields using a high-resolution (1,080 x 1,440 pixels) smartphone camera. The collected images were carefully cleaned and annotated with the help of an agronomist. Visit the Paddy Doctor project website https://paddydoc.github.io for more information.

Instructions:

Background

Paddy is a ubiquitous crop in most Asian countries. Paddy farming is a complex process affected by many diseases and pests. The early and accurate identification of these paddy diseases is a daunting task for farmers to prevent significant yield loss. Traditionally, farmers employ manual techniques based on their experience and visual inspection to identify paddy diseases, but this is highly inefficient, time-consuming, and error-prone. Therefore, there is an increasing need to develop automated solutions that can help an accurate diagnosis of diseases, which will reduce pesticide usage and subsequently minimize the loss in yield. However, the lack of availability of public datasets with annotated disease names was a major bottleneck to benchmarking the recent deep learning-based models and wider adoption of the solutions. Therefore we developed and open-sourced our Paddy Doctor dataset to enable the development of efficient and robust paddy disease diagnosis systems.

How were the images collected and annotated?

We collected RGB images of paddy leaves from real paddy fields in a village near the Tirunelveli district of Tamilnadu, India. The data collection happened from February to April 2021, when the age of the paddy crop was between 40 to 80 days. We used the CAT S62 Pro smartphone with a built-in camera to capture high-resolution RGB images. Our initial dataset contained approximately 30,000 images in JPEG format with a pixel resolution of 1,080 x 1,440. Next, we carefully examined each sample and removed the bad and duplicate images. After image cleaning, we are left with 16,225 images. Then, we manually annotated each image, with the help of an agronomist, based on the presence of disease symptoms and assigned a corresponding label, i.e., paddy disease name or normal. After annotation, the final dataset had 13 classes, corresponding to 12 diseases and healthy leaves. In addition to the RGB images, we manually collected additional metadata for each leaf image, such as the variety and age of the paddy crop when these images were collected.

The following figure shows the data collection and annotation workflow. The names of 12 paddy diseases and the number of images are also shown.

Here is the data collection summary.

Crop name: Paddy
Total number of images: 16,225
Total number of classes: 13 (12 paddy diseases and normal leaf)
Image type: Visual (RGB)
Image file type: JPEG
Image resolution: 1,080 x 1,440 pixels
Smartphone device used: CAT S62 Pro
Data collection period: February to April 2021
Data collection location: Pallamadai, Tamil Nadu, India - 627357
Additional metadata: paddy age and variety for each image

Here are 12 sample paddy disease images from our Paddy Doctor dataset.

The Paddy Doctor Dataset Files

We release four variants of our original Paddy Doctor dataset (see the dataset files section) to increase its usability. They are:

paddy-doctor-diseases.zip: This file (~4.7 GB) contains 16,225 original high-resolution (1080x1440) images collected from the paddy fields.
paddy-doctor-diseases-medium.zip: This file (~1.24 GB) contains 16,225 resized images (converted from the original version) with a resolution of 480x640 pixels.
paddy-doctor-diseases-small.zip: This file (~323 MB) contains 16,225 resized images (converted from the original version) with a resolution of 256x256 pixels.
paddy-doctor-diseases-small-split.zip: This file (~323 MB) contains 16,225 small images (256x256) split into train and test sets. The train set has 12,980 (80%), and the test set has the remaining 3,245 (20%) images. The train and test sets were stratified based on class labels and paddy variety (See the metadata section). This version may be suitable for easily experimenting with existing deep-learning models that usually require split datasets. See the scripts section for example code using this version.

Balanced and augmented datasets

It is to be noted that, as seen from the data collection figure, the original Paddy Doctor dataset is unbalanced. Therefore, we created a balanced dataset by randomly selecting 400 images for each class and splitting them into the train (80%) and test (20%) sets for easy use.

paddy-doctor-diseases-small-400-split.zip: This file (~105 MB) contains 5,200 (400 x 13 classes) small images split into the train (4,160 images) and test (1,040) sets.

In addition to the original dataset, we also provide four augmented versions of our Paddy Doctor dataset. They are:

paddy-doctor-diseases-small-augmented-26k.zip: This file (~506 MB) contains 26,000 (2,000 x 13) augmented images with an equal number of samples from each of the 13 classes. This is also a balanced version of our original dataset.
paddy-doctor-diseases-small-augmented-26k-split.zip: This file (~506 MB) contains 26,000 (2,000 x 13) augmented images with an equal number of samples (2000) from each of the 13 classes. This is a split version of the above set. The train set has 20,800 (80%), and the test set has the remaining 5,200 (20%) images. This is also a balanced version of our original dataset.
paddy-doctor-diseases-small-augmented-65k.zip: This file (~1.24 GB) contains 65,000 (5,000 x 13) augmented images with an equal number of samples (5000) from each of the 13 classes. This is also a balanced version of our original dataset.
paddy-doctor-diseases-small-augmented-5x.zip: This file (~1.54 GB) contains 81,125 augmented images. This version is five times larger than the original version (5 x 16,225 = 81,125).

These augmented Paddy Doctor datasets are generated by applying different image transformation operations, such as random rotation (5 degrees), shear (0.2), zoom (20%), and horizontal flip, to the low-resolution version (paddy-doctor-diseases-small.zip). The code used to generate these images is given in the scripts section.

Directory structure

Each Paddy Doctor dataset file (e.g., paddy-doctor-diseases-small.zip) contains 13 folders corresponding to 13 classes (12 paddy diseases and normal) and a metadata file named metadata.csv. The directory structure of the dataset is as follows:

paddy-doctor-diseases-small/

├─metadata.csv

├─bacterial_leaf_blight/

│ ├─PDD00001.jpg

│ ├─PDD00002.jpg

│ └─PDD00003.jpg

│ ...

├─bacterial_leaf_streak/

├─bacterial_panicle_blight/

├─black_stem_borer/

├─blast/

├─brown_spot/

├─downy_mildew/

├─hispa/

├─leaf_roller/

├─metadata.csv

├─normal/

├─tungro/

└─white_stem_borer/

The directory structure of the split datasets is as follows:

paddy-doctor-diseases-small-split/

├─metadata-test.csv

├─metadata-train.csv

├─metadata.csv

├─train/

│ └─13 folders

└─test/

└─13 folders

Metadata files

All dataset files contain a metadata file named metadata.csv. There are four columns in the metadata file, they are:

image_id - Unique image identifier corresponds to image file names (e.g., PDD00001.jpg) found within the individual class directories.
label - Type of paddy disease, also the target class. There are 13 classes, including the normal leaf.
variety - The name of the paddy variety in this image.
age - Age of the paddy in days when image was collected.

Here are a few records from the metadata file.

image_id,label,variety,age
PDD00001.jpg,bacterial_leaf_blight,45,65
PDD00002.jpg,bacterial_leaf_blight,45,60
PDD00003.jpg,bacterial_leaf_blight,45,55
...

Note that the metadata file for split datasets (e.g., paddy-doctor-diseases-small-split.zip) contains an additional column called split that denotes whether that image belongs to the train or test set.

Scripts and Resources

Several Python scripts are provided to analyze the Paddy Doctor dataset and develop paddy disease classification models. Here is a list of files available:

split.py - A Python script to split the final dataset into train/test sets.
augmentation.py - A Python script to create augmented datasets.
exploratory_data_analysis.py - A Python script for exploratory data analysis, such as analyzing metadata and plotting data distributions.
deep_cnn.py - A Python script to develop a simple deep CNN model using TensorFlow. This model achieved the F1-score of 0.83462.
resnet34.py - A Python script to develop and fine-tune the Resnet34 model using FastAI. This model achieved the F1-score of 0.94615.

Additional code and resources

Code Ocean Capsule (https://doi.org/10.24433/CO.0659329.v1) - This capsule has reproducibility code for exploratory data analysis and a deep CNN model for paddy disease classfication.
Benchmark study (https://paddydoc.github.io/code/) - This site contains code and results from our benchmarking study comparing the performance of five deep-learning models on the Paddy Doctor dataset.
GitHub repository (https://github.com/paddydoc/paddy-doctor-dataset/tree/main/notebooks) - This GitHub repository contains all open-source codes and additional information.
Kaggle Competition (https://www.kaggle.com/c/paddy-disease-classification/) - This kaggle competition has many public Kaggle kernels with deep learning models for paddy leaf disease classfication.

We believe these starting codes would be helpful for anyone to explore our Paddy Doctor dataset and develop advanced models for paddy disease classification.

Future work

In addition to the RGB images, we collected infrared images of paddy leaves with diseases and pests. We are currently processing those images and will be releasing them here soon. Moreover, plans are underway to expand our Paddy Doctor dataset by collecting fine-grained disease information (using hyper-spectral imagers) about paddy diseases and pests for early diagnosis.

References

Petchiammal A, Briskline Kiruba S, D. Murugan, and Pandarasamy A. 2023. Paddy Doctor: A Visual Image Dataset for Automated Paddy Disease Classification and Benchmarking. In 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD) (CODS-COMAD 2023), January 4–7, 2023, Mumbai, India. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3570991.3570994

Petchiammal, A., BrisklineKiruba, S., Murugan, D., & Pandarasamy, A. (2022). "Paddy Doctor: A Visual Image Dataset for Automated Paddy Disease Classification and Benchmarking." arXiv preprint arXiv:2205.11108 (2022). doi=https://doi.org/10.48550/arXiv.2205.11108

The data set is very useful to find the diseases in the paddy.

A.Hemalatha Hema Tue, 11/29/2022 - 15:59 Permalink