Datasets
Standard Dataset
Paddy Doctor: A Visual Image Dataset for Automated Paddy Disease Classification and Benchmarking
- Citation Author(s):
- Submitted by:
- Pandarasamy Arjunan
- Last updated:
- Thu, 02/16/2023 - 08:54
- DOI:
- 10.21227/hz4v-af08
- Data Format:
- Research Article Link:
- Links:
- License:
- Categories:
- Keywords:
Abstract
The Paddy Doctor dataset contains 16,225 labeled paddy leaf images across 13 classes (12 different paddy diseases and healthy leaves). It is the largest expert-annotated visual image dataset to experiment with and benchmark computer vision algorithms. The paddy leaf images were collected from real paddy fields using a high-resolution (1,080 x 1,440 pixels) smartphone camera. The collected images were carefully cleaned and annotated with the help of an agronomist. Visit the Paddy Doctor project website https://paddydoc.github.io for more information.
Background
How were the images collected and annotated?
Here is the data collection summary.
- Crop name: Paddy
- Total number of images: 16,225
- Total number of classes: 13 (12 paddy diseases and normal leaf)
- Image type: Visual (RGB)
- Image file type: JPEG
- Image resolution: 1,080 x 1,440 pixels
- Smartphone device used: CAT S62 Pro
- Data collection period: February to April 2021
- Data collection location: Pallamadai, Tamil Nadu, India - 627357
- Additional metadata: paddy age and variety for each image
Here are 12 sample paddy disease images from our Paddy Doctor dataset.
The Paddy Doctor Dataset Files
We release four variants of our original Paddy Doctor dataset (see the dataset files section) to increase its usability. They are:
- paddy-doctor-diseases.zip: This file (~4.7 GB) contains 16,225 original high-resolution (1080x1440) images collected from the paddy fields.
- paddy-doctor-diseases-medium.zip: This file (~1.24 GB) contains 16,225 resized images (converted from the original version) with a resolution of 480x640 pixels.
- paddy-doctor-diseases-small.zip: This file (~323 MB) contains 16,225 resized images (converted from the original version) with a resolution of 256x256 pixels.
- paddy-doctor-diseases-small-split.zip: This file (~323 MB) contains 16,225 small images (256x256) split into train and test sets. The train set has 12,980 (80%), and the test set has the remaining 3,245 (20%) images. The train and test sets were stratified based on class labels and paddy variety (See the metadata section). This version may be suitable for easily experimenting with existing deep-learning models that usually require split datasets. See the scripts section for example code using this version.
Balanced and augmented datasets
- paddy-doctor-diseases-small-400-split.zip: This file (~105 MB) contains 5,200 (400 x 13 classes) small images split into the train (4,160 images) and test (1,040) sets.
In addition to the original dataset, we also provide four augmented versions of our Paddy Doctor dataset. They are:
- paddy-doctor-diseases-small-augmented-26k.zip: This file (~506 MB) contains 26,000 (2,000 x 13) augmented images with an equal number of samples from each of the 13 classes. This is also a balanced version of our original dataset.
- paddy-doctor-diseases-small-augmented-26k-split.zip: This file (~506 MB) contains 26,000 (2,000 x 13) augmented images with an equal number of samples (2000) from each of the 13 classes. This is a split version of the above set. The train set has 20,800 (80%), and the test set has the remaining 5,200 (20%) images. This is also a balanced version of our original dataset.
- paddy-doctor-diseases-small-augmented-65k.zip: This file (~1.24 GB) contains 65,000 (5,000 x 13) augmented images with an equal number of samples (5000) from each of the 13 classes. This is also a balanced version of our original dataset.
- paddy-doctor-diseases-small-augmented-5x.zip: This file (~1.54 GB) contains 81,125 augmented images. This version is five times larger than the original version (5 x 16,225 = 81,125).
These augmented Paddy Doctor datasets are generated by applying different image transformation operations, such as random rotation (5 degrees), shear (0.2), zoom (20%), and horizontal flip, to the low-resolution version (paddy-doctor-diseases-small.zip). The code used to generate these images is given in the scripts section.
Directory structure
Each Paddy Doctor dataset file (e.g., paddy-doctor-diseases-small.zip) contains 13 folders corresponding to 13 classes (12 paddy diseases and normal) and a metadata file named metadata.csv. The directory structure of the dataset is as follows:
- image_id - Unique image identifier corresponds to image file names (e.g., PDD00001.jpg) found within the individual class directories.
- label - Type of paddy disease, also the target class. There are 13 classes, including the normal leaf.
- variety - The name of the paddy variety in this image.
- age - Age of the paddy in days when image was collected.
Here are a few records from the metadata file.
image_id,label,variety,age
PDD00001.jpg,bacterial_leaf_blight,45,65PDD00002.jpg,bacterial_leaf_blight,45,60PDD00003.jpg,bacterial_leaf_blight,45,55...
- split.py - A Python script to split the final dataset into train/test sets.
- augmentation.py - A Python script to create augmented datasets.
- exploratory_data_analysis.py - A Python script for exploratory data analysis, such as analyzing metadata and plotting data distributions.
- deep_cnn.py - A Python script to develop a simple deep CNN model using TensorFlow. This model achieved the F1-score of 0.83462.
- resnet34.py - A Python script to develop and fine-tune the Resnet34 model using FastAI. This model achieved the F1-score of 0.94615.
Additional code and resources
- Code Ocean Capsule (https://doi.org/10.24433/CO.0659329.v1) - This capsule has reproducibility code for exploratory data analysis and a deep CNN model for paddy disease classfication.
- Benchmark study (https://paddydoc.github.io/code/) - This site contains code and results from our benchmarking study comparing the performance of five deep-learning models on the Paddy Doctor dataset.
- GitHub repository (https://github.com/paddydoc/paddy-doctor-dataset/tree/main/notebooks) - This GitHub repository contains all open-source codes and additional information.
- Kaggle Competition (www.kaggle.com/c/paddy-disease-classification/">https://www.kaggle.com/c/paddy-disease-classification/) - This kaggle competition has many public Kaggle kernels with deep learning models for paddy leaf disease classfication.
Dataset Files
- Original - 16,225 large (1080x1440) images paddy-doctor-diseases.zip (4.64 GB)
- Medium - 16,225 medium-size (480x640) images paddy-doctor-diseases-medium.zip (1.23 GB)
- Small - 16,225 small (256x256) images paddy-doctor-diseases-small.zip (322.15 MB)
- Small, balanced and split - 5,200 (400 random images across 13 classes) small images split into train (80%) and test (20%) sets paddy-doctor-diseases-small-400-split.zip (104.09 MB)
- Small and split - 16,225 small images split into train (80%) and test (20%) sets paddy-doctor-diseases-small-split.zip (322.40 MB)
- Augmented - 26,000 augmented images (2000 images across 13 classes) paddy-doctor-diseases-small-augmented-26k.zip (505.87 MB)
- Augmented and split - 26,000 augmented images split into train (80%) and test (20%) sets paddy-doctor-diseases-small-augmented-26k-split.zip (506.27 MB)
- Augmented - 65,000 augmented images (5000 images across 13 classes) paddy-doctor-diseases-small-augmented-65k.zip (1.24 GB)
- Augmented - 81,125 augmented images, 5 times larger than the original version (5 x 16,225) paddy-doctor-diseases-small-augmented-5x.zip (1.54 GB)
- Metadata file metadata.csv (526.09 kB)
- split.py - Python script to split the final dataset into train/test sets. split.py.txt (3.81 kB)
- augmentation.py - Python script to create augmented datasets. augmentation.txt (8.46 kB)
- Python script for exploratory data analysis. exploratory_data_analysis.py (2.47 kB)
- Python script to develop a deep CNN model for paddy disease classification using TensorFlow. cnn_diseases_small_400_split_epoch100.py (13.17 kB)
- Python script to develop a fine-tuned Resnet34 model for paddy disease classification using FastAI. resnet34_diseases_small_400_split.py (3.62 kB)
Documentation
Attachment | Size |
---|---|
Readme file | 2.61 MB |
Research article | 1.02 MB |
Comments
The data set is very useful to find the diseases in the paddy.
Asalam a Alikum