Enhanced Cardiovascular Disease Dataset with Data Augmentation

Citation Author(s):
José L.
López-Saynes
Elías N.
Escobar-Gómez
Héctor R.
Hernández-de-León
Néstor A.
Morales-Navarro
Submitted by:
Jose Lopez Saynes
Last updated:
Mon, 03/03/2025 - 15:18
DOI:
10.21227/v8bh-y702
Data Format:
Research Article Link:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset comprises 2 million synthetic samples generated using the Variational Autoencoder-Generative Adversarial Network (VAE-GAN) technique. The dataset is designed to facilitate cardiovascular disease prediction through various demographic, physical, and health-related attributes. It contains essential physiological and behavioral indicators that contribute to cardiovascular health.

Dataset Description The dataset consists of the following features:

  • Age (int, days): The age of the individual.

  • Height (int, cm): The height of the individual in centimeters.

  • Weight (float, kg): The weight of the individual in kilograms.

  • Body Mass Index (BMI) (float): Calculated as , providing an indicator of body fat.

  • Gender (categorical code): Encoded as 1 for female and 2 for male.

  • Systolic Blood Pressure (ap_hi) (int): The maximum arterial pressure during heartbeats.

  • Diastolic Blood Pressure (ap_lo) (int): The minimum arterial pressure between heartbeats.

  • Cholesterol (categorical): 1 for normal, 2 for above normal, and 3 for well above normal levels.

  • Glucose (categorical): 1 for normal, 2 for above normal, and 3 for well above normal levels.

  • Smoking (binary): 1 if the individual smokes, 0 otherwise.

  • Alcohol Intake (binary): 1 if the individual consumes alcohol, 0 otherwise.

  • Physical Activity (binary): 1 if the individual engages in regular physical activity, 0 otherwise.

Target Variable

  • Cardiovascular Disease (cardio) (binary): The presence (1) or absence (0) of cardiovascular disease.

This dataset provides a comprehensive set of features that can be used for machine learning models in cardiovascular disease prediction, enabling research and analysis on health-related risk factors and prevention strategies.

Instructions: 

Instructions for Using the Dataset

  1. Download the Dataset

    • Download the dataset file from IEEE DataPort.
  2. Install Required Libraries

    • Ensure that you have the necessary libraries installed, such as pandas and numpy.
  3. Load the Dataset in Python

    • Once the file is downloaded, load it into your Python environment using an appropriate tool (e.g., pandas for .csv files).
  4. Explore the Dataset

    • Review the dataset to understand the columns and the types of data it contains.
  5. Handle Missing Data

    • If there are missing values in the dataset, you can choose to remove rows with missing data or fill them with appropriate values.
  6. Select Relevant Features

    • Select the columns or features that are important for your analysis or modeling.
  7. Prepare for Analysis

    • Prepare the data, ensuring that the variables are in the correct format for analysis or modeling.
  8. Save the Processed Dataset

    • After making modifications or cleaning the data, save the processed dataset into a new file for future use.

Documentation

AttachmentSize
File readme.txt1.65 KB