Datasets
Standard Dataset
Financial dataset
- Citation Author(s):
- Submitted by:
- Oluseyi Olaleye
- Last updated:
- Mon, 07/08/2024 - 15:58
- DOI:
- 10.21227/mhdg-f074
- License:
- Categories:
- Keywords:
Abstract
The process of dataset generation comprises three integral components: "Account Profiles," responsible for creating detailed account representations; "Transaction Generation," which simulates a diverse range of financial transactions; and the "Generation of Fraud Scenarios," which introduces predefined templates for identifying potential fraudulent transactions based on various criteria. Together, these components collaboratively construct a dynamic and realistic dataset, mirroring real-world financial systems.
The account profile generation process involves using randomization techniques to create a pandas DataFrame representing individual customer account profiles. Each profile includes attributes such as ID, mean transaction amount, standard deviation of transaction amounts, and mean daily transaction frequency. The generation function accepts parameters for the number of profiles and a "random_state" for reproducibility.
The transaction generation process systematically simulates the creation of a comprehensive dataset with 5000 unique account IDs over 183 days. Each transaction record includes a timestamp, account ID, transaction type (ATM Withdrawal, Teller Withdrawal, or Online transaction), and amount. Daily transaction frequency is determined probabilistically, with transaction timing and amount generated through random processes based on account profiles. Realism is maintained, ensuring all amounts are positive.
The generation of fraud scenarios comprises five distinct scenarios targeting specific types of fraudulent behavior. These scenarios consider transaction type, amount, timing, and historical account behavior to detect anomalies indicating fraud. Examples include identifying unusual transaction types, comparing transaction amounts to historical behavior, examining temporal aspects of transactions, and analyzing transaction velocity within specific time windows. These scenarios collectively form a robust framework for identifying potential fraudulent transactions within the generated dataset, enhancing authenticity through the use of probabilistic distributions and specific rules mirroring real-world financial patterns.
Dataset is in its prepocessed state
Comments
Hello, I am interested in using this dataset for academic purposes, specifically in fraud detection research. I noticed that the dataset is preprocessed. Is there any possibility to access the raw, unprocessed data, if available? Thank you!