Kavach Shah
Wed, 01/03/2024 - 15:33
We downloaded the dataset of Hindi Poems from the Website, contains around 2500 poems the downloaded dataset link is: link In the initial phase of our data preprocessing pipeline, we collected text data from a diverse set of HTML files, totaling 2500 documents. These files, constituting a substantial corpus, were meticulously curated for subsequent analysis. To facilitate further investigation, we amalgamated all the extracted text into a consolidated text file, a crucial step in preparing the data for subsequent processing. The first step in refining the collected dataset involved the removal of extraneous characters that did not belong to the Devanagari script. This meticulous process ensured that the ensuing analysis would be focused exclusively on the relevant linguistic elements, enhancing the quality and coherence of the dataset. To enhance the text’s readability and maintain consistency across the dataset, we implemented a procedure to replace multiple consecutive newline characters with a single newline. Following this, we diligently stripped any leading and trailing spaces, contributing to a more uniform and standardized format for subsequent analysis. In an effort to refine the dataset further, we executed a filtering mechanism to exclude numerical characters written in Hindi script. This step aimed to eliminate nonlinguistic elements and enhance the linguistic purity of the dataset, laying the groundwork for more accurate and meaningful analyses. The amalgamation of these preprocessing steps not only streamlined the dataset but also set the stage for a more robust and focused examination of the linguistic content within the Devanagari script. This comprehensive preprocessing pipeline not only addresses the intricacies of handling multitudinous files but also underscores our commitment to rigorously refining the dataset for subsequent research and analysis


