The performance of most of the classification models is dependent on the data used for training. The data must be reliable, robust and meticulously labelled. In order to form such a data a systematical approach has been designed and moreover, it should be. The data set was collected from a well-known source, namely Center for Language Engineering available at http://www.cle.org.pk. The corpus available on the website used for prediction contains Urdu Naskh data having 4,325 number of lines and 1, 22284 words. This corpus contains three text files. The mentioned corpus is converted into Jameel Noori Urdu Nastalique font style having 4,325 number of lines and 1, 22284 words. Due to context sensitive nature of Urdu Nastalique it poses several challenges. The mentioned corpus text is converted into images because in OCR systems ligature segmentation and line segmentation of images is itself a challenging task.
1. Extract Urdu Nastalique (All Images) in a folder.
2. Extract Urdu Nastalique (AllSets) in the same folder. This folder will contain seven different sets ligature classes. Each set contains different ligature classes samples. 3. Click on Select Image. Now select any image from Urdu Nastalique (All Images) folder. These images can be used for training and testing. 4. Each ligature class contains 15 samples. We did this for uniformity, better recognition and inorder to distinguish one ligtaure from another. 5. You can use these ligature classes for output classes prediction. 6. Each Set contains 161 ligature classes except the last set i.e. Set 7. Set 1 to 6 contains 161 ligature classes. 7. The 161 class contains other classes ligatures samples. The 161 class contains 4 samples of other ligature classes.