Dataset for Assessing Water Quality for Drinking and Irrigation Purposes using Machine Learning Models
Access to potable water is a critical requirement for human survival. Beyond drinking, water is also necessary for animal consumption, irrigation, as well as domestic and commercial uses. Laboratory assessments of water samples to determine their fitness for use is a vital step in water quality assurance processes. However, laboratory assessments require adherence to stringent measures, which might be difficult to comply with. Machine learning (ML) has emerged in recent years as viable and cheaper solutions to complement (or replace) lab-based assessments, with a caveat of availability of sufficient data to train the ML models. Unfortunately, such data are not always (or sparsely) available, especially in less developed countries. To this end, the work attempts to fill this gap by creating ample sized datasets that can be used to train (and test) ML models. Two datasets are curated in this work, one for drinking water and the other for irrigation water. The datasets were curated by aggregating data from smaller datasets on related concepts, then processed and labelled to make them useful for supervised ML models. To prove the applicability of the curated datasets, they were used to train ML models in a related work and yielded good results.
The datasets are in CSV formats and contain physico-chemical parameters of water from different sources. Each data entry is labelled as 0 or 1 representing usable or not usable respectively. This label field is the last column in each dataset. Two datasets are uploaded, the first is for drinking / potable water, while the second is for water usable for irrigation. An algorithm for calculating the label value has also been included in the documentation, as well as the python scripts used to calculate the values.