The addy Doctor dataset contains 16,225 labeled paddy leaf images across 13 classes (12 different paddy diseases and healthy leaves). It is the largest expert-annotated visual image dataset to experiment with and benchmark computer vision algorithms. The paddy leaf images were collected from real paddy fields using a high-resolution (1,080 x 1,440 pixels) smartphone camera. The collected images were carefully cleaned and annotated with the help of an agronomist. Visit the Paddy Doctor project website https://paddydoc.github.io for more information.



Paddy is a ubiquitous crop in most Asian countries. Paddy farming is a complex process affected by many diseases and pests. The early and accurate identification of these paddy diseases is a daunting task for farmers to prevent significant yield loss. Traditionally, farmers employ manual techniques based on their experience and visual inspection to identify paddy diseases, but this is highly inefficient, time-consuming, and error-prone. Therefore, there is an increasing need to develop automated solutions that can help an accurate diagnosis of diseases, which will reduce pesticide usage and subsequently minimize the loss in yield. However, the lack of availability of public datasets with annotated disease names was a major bottleneck to benchmarking the recent deep learning-based models and wider adoption of the solutions. Therefore we developed and open-sourced our Paddy Doctor dataset to enable the development of efficient and robust paddy disease diagnosis systems.

How were the images collected and annotated?

We collected RGB images of paddy leaves from real paddy fields in a village near the Tirunelveli district of Tamilnadu, India. The data collection happened from February to April 2021, when the age of the paddy crop was between 40 to 80 days. We used the CAT S62 Pro smartphone with a built-in camera to capture high-resolution RGB images. Our initial dataset contained approximately 30,000 images in JPEG format with a pixel resolution of 1,080 x 1,440. Next, we carefully examined each sample and removed the bad and duplicate images. After image cleaning, we are left with 16,225 images. Then, we manually annotated each image, with the help of an agronomist, based on the presence of disease symptoms and assigned a corresponding label, i.e., paddy disease name or normal. After annotation, the final dataset had 13 classes, corresponding to 12 diseases and healthy leaves. In addition to the RGB images, we manually collected additional metadata for each leaf image, such as the variety and age of the paddy crop when these images were collected. 
The following figure shows the data collection and annotation workflow. The names of 12 paddy diseases and the number of images are also shown.

Here is the data collection summary.

  1. Crop name: Paddy
  2. Total number of images: 16,225
  3. Total number of classes: 13 (12 paddy diseases and normal leaf)
  4. Image type: Visual (RGB)
  5. Image file type: JPEG
  6. Image resolution: 1,080 x 1,440 pixels
  7. Smartphone device used: CAT S62 Pro
  8. Data collection period:  February to April 2021
  9. Data collection location: Pallamadai, Tamil Nadu, India - 627357
  10. Additional metadata: paddy age and variety for each image

Here are 12 sample paddy disease images from our Paddy Doctor dataset.