Datasets
Standard Dataset
Dataset for Inclusive Fintech Software Development
- Citation Author(s):
- Submitted by:
- Belinda Kobusingye
- Last updated:
- Mon, 09/23/2024 - 00:35
- DOI:
- 10.21227/jp32-an80
- Data Format:
- License:
102 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
This study presents a English-Luganda parallel corpus comprising over 2,000 sentence pairs, focused on financial decision-making and products. The dataset draws from diverse sources, including social media platforms (TikTok comments and Twitter posts from authoritative accounts like Bank of Uganda and Capital Markets Uganda), as well as fintech blogs (Chipper Cash and Xeno). The corpus covers a range of financial topics, including bonds, loans, and unit trust funds, providing a comprehensive resource for financial language processing in both English and Luganda.
Instructions:
- Load the dataset using pandas.
- Inspect the data to understand its structure and identify potential issues.
- Handle missing values by filling the 'source' column with 'Unknown' and dropping rows with missing values in 'english' or 'luganda' columns.
- Normalize text in both 'english' and 'luganda' columns by converting to lowercase, removing extra whitespace, and removing special characters.
- Adjust these steps as needed based on your specific dataset characteristics and project requirements.
Funding Agency:
Makerere University Research and Innovations Fund