Enhanced Cardiovascular Disease Dataset with Data Augmentation

Citation Author(s):: José L. López-Saynes

Elías N. Escobar-Gómez

Héctor R. Hernández-de-León

Néstor A. Morales-Navarro
Submitted by:: Jose Luis Lopez-Saynes
Last updated:: Mon, 03/03/2025 - 20:18
DOI:: 10.21227/v8bh-y702
Data Format:: *.csv
Research Article Link:: Analysis of Physiological Parameters for Assessing the Risk Level of Cardiovasc…
Links:: Cardiovascular Disease Dataset

644 views

Categories:

Keywords:

Cardiovascular disease

data augmentation

synthetic data

Balanced Data set

ACCESS DATASET CITE

Abstract

This dataset comprises 2 million synthetic samples generated using the Variational Autoencoder-Generative Adversarial Network (VAE-GAN) technique. The dataset is designed to facilitate cardiovascular disease prediction through various demographic, physical, and health-related attributes. It contains essential physiological and behavioral indicators that contribute to cardiovascular health.

Dataset Description The dataset consists of the following features:

Age (int, days): The age of the individual.
Height (int, cm): The height of the individual in centimeters.
Weight (float, kg): The weight of the individual in kilograms.
Body Mass Index (BMI) (float): Calculated as , providing an indicator of body fat.
Gender (categorical code): Encoded as 1 for female and 2 for male.
Systolic Blood Pressure (ap_hi) (int): The maximum arterial pressure during heartbeats.
Diastolic Blood Pressure (ap_lo) (int): The minimum arterial pressure between heartbeats.
Cholesterol (categorical): 1 for normal, 2 for above normal, and 3 for well above normal levels.
Glucose (categorical): 1 for normal, 2 for above normal, and 3 for well above normal levels.
Smoking (binary): 1 if the individual smokes, 0 otherwise.
Alcohol Intake (binary): 1 if the individual consumes alcohol, 0 otherwise.
Physical Activity (binary): 1 if the individual engages in regular physical activity, 0 otherwise.

Target Variable

Cardiovascular Disease (cardio) (binary): The presence (1) or absence (0) of cardiovascular disease.

This dataset provides a comprehensive set of features that can be used for machine learning models in cardiovascular disease prediction, enabling research and analysis on health-related risk factors and prevention strategies.

Instructions:

Instructions for Using the Dataset

Download the Dataset
- Download the dataset file from IEEE DataPort.
Install Required Libraries
- Ensure that you have the necessary libraries installed, such as pandas and numpy.
Load the Dataset in Python
- Once the file is downloaded, load it into your Python environment using an appropriate tool (e.g., pandas for .csv files).
Explore the Dataset
- Review the dataset to understand the columns and the types of data it contains.
Handle Missing Data
- If there are missing values in the dataset, you can choose to remove rows with missing data or fill them with appropriate values.
Select Relevant Features
- Select the columns or features that are important for your analysis or modeling.
Prepare for Analysis
- Prepare the data, ensuring that the variables are in the correct format for analysis or modeling.
Save the Processed Dataset
- After making modifications or cleaning the data, save the processed dataset into a new file for future use.