Stroke Prediction Dataset

Citation Author(s):
Ahmad
Hassan
Submitted by:
Ahmad Hassan
Last updated:
Tue, 11/21/2023 - 16:19
DOI:
10.21227/mxfb-sc71
Data Format:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Acknowledgements

(Confidential Source) - Use only for educational purposes
If you use this dataset in your research, please credit the author.

Instructions: 

Libraries used for dataset processing

  • Numpy
  • Pandas

Libraries used for graphical representation

  • Matplotlib
  • Seaborn

Libraries used for Scaling and Oversampling

  • Sklearn.preprocessing
  • Imblearn

PREPROCESSING

  • Removed the id column – decreasing the dimension – did not add to insights in the data analysis.
df = df.drop(['id'],axis=1)

 

  • Count for NULL values are checked among the attributes of the dataset
print(df.isna().sum())

 

  • Only BMI-Attribute had NULL values
  • Plotted BMI's value distribution - looked skewed - therefore imputed the missing values using the median.
  • Didn’t eliminate the records due to dataset being highly skewed on the target attribute – stroke and a good portion of the missing BMI values had accounted for positive stroke
  • The dataset was skewed because there were only few records which had a positive value for stroke-target attribute

  • In the gender attribute, there were 3 types - Male, Female and Other. There was only 1 record of the type "other", Hence it was converted to the majority type – decrease the dimension

  • Most of the attributes in the dataset were binary values – converting the numeric bin values into string bin values for dummy encoding.

    • Dummy encoding similar to one-hot encoding – Values in the binary ecoded columns are 1/0 – Additional attributes/columns created.
  • Random oversampling done on the dataset to balance the skew in the target attributes.

    • Boosting the number of records in the minority class – records

EDA - Exploratory Data Analysis

  • Plotted plots of each attribute - Analyse trends if any – plots: pie, histogram.
  • Plotted relation of target attribute to other attributes to find any correlation.
  • Plotted the heatmap – correlation plot between the attributes.
    • Heatmap showed very less correlation between the attribute values.

Comments

Need dataset

Submitted by Putra Wanda on Fri, 11/24/2023 - 10:11

Did you get the Dataset? If yes kindly share

Submitted by Anuradha Taluja on Thu, 05/02/2024 - 02:33

Can You share the Dataset as DOI mentioned is not working

Submitted by Anuradha Taluja on Thu, 05/02/2024 - 02:33

Documentation

AttachmentSize
File README.md2.82 KB