Predicting stroke (I): Trying different classification algorithms

Assignment for “Knowledge technology”, course in the Master in Methodology for behavioral sciences. This post correspond to the first part of the “series” Predicting stroke, which consists of a set of independent works in which I try to solve an applied machine learning problem: to accurately classify people who have suffered a stroke or not, based on different predictors.

Description of data

stroke2.csv. The data used to be in Kaggle but I had to download it from a github repository of someone who had analyzed the data set before. Here you can see the data before the preprocessing.

	gender	age	hypertension	heart_disease	ever_married	work_type
1	Male	3	No	No	No	children
2	Male	58	Yes	No	Yes	Private
3	Female	8	No	No	No	Private

… (43000 x 10)

	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
1	Rural	95.12	18	never smoked	No
2	Urban	87.96	39.2	never smoked	No
3	Urban	110.89	17.6	never smoked	No

… (43000 x 10)

The variables are self-explanatory but I had no codebook so whenever I encountered inconsistencies or “strange” things I had to interpret the data. For example, the third observation corresponds to a kid of 8 years old who works in the private sector. Something must be wrong but is it the age or the work type? It’s more plausible that work type is wrong, but technically we don’t know what is wrongly coded.

One important thing to note about the data set is the frequency distribution of the outcome variable. 98% of the sample is people who haven’t suffered a stroke. This severe imbalance can be very troublesome (accuracy “paradox”: you get 98% overall accuracy but with 0% sensitivity because all instances are classified as members of the majority class) so it’s neccesary to deal with it.

For the first model I opted for the downsampling strategy. If the minority class has $n$ observations, you select a sample of $n$ from the set of instances that belong to the majority class. Now you have a new sample of $n+n$ examples and no assymmetry in the variable of interest. The problem is that you reduce the total sample size, you loose information and that you are changing the natural distribution of the variable just for convenience (the prevalence of stroke is not 50%). For those reasons I changed the strategy. In the decision tree and SVM models I used cost-sensitive learning, which involves imposing a cost matrix that weights false negatives as more serious errors. Intuitively we could say that you need to be more confident in order to classify someone as belonging to the stroke category (more details in the model evaluation report)

Classification models

Linear Discriminant Analysis

Final model (quadratic discriminant function really, not linear) includes only age as predictor

qda=fitcdiscr(featQDA,group, ...
    'DiscrimType','quadratic', ...
    'CrossVal','on', ...
    'Cost',[0 1;70 0]);

Link to code.

Link to the report.

Decision tree

Final model includes age, bmi (both discretized), hypertension and heart disease as predictors.

tree=fitctree(feattree,group, ...
    'CategoricalPredictors','all', ...
    'SplitCriterion','deviance', ...
    'MinLeafSize',200, 'MinParentSize',230, ...
    'CrossVal','on', ... 
    'Cost',[0 1;70 0]);

Link to code.

Link to the report.

Support Vector Machine

Final model includes age, avg_glucose_level, hypertension, smoking_status and heart_disease as features.

SVM=fitcsvm(featSVM,group, ...
    'CategoricalPredictors',{'hypertension','smoking_status','heart_disease'}, ...
    'BoxConstraint', 0.24, ...
    'KernelFunction','linear', ...
    'CrossVal','on', ...
    'Cost',[0 1;55 0]);

Link to code.

Link to the report.

Comments

This was my first time applying Linear Discriminant Analysis, decision trees, SVMs and also the first time using Matlab. I’ve experimented a lot doing this project and I’ve probably committed a lot of mistakes. The hyperparemeter tuning process was a bit like a mess and I don’t expect them to be optimal.
The class imbalance problem gave me a lot of headaches. I tried different approaches to overcome the problem (downsampling and cost-sensitive learning), and I finally stuck with one approach but I’m not sure that was the best decision.