Titanic: Machine Learning from Disaster
Famous Kaggle project for beginners in machine learning.
Description
Eventhough nearly a year has passed since I started studying seriously data analysis and two since I learned Python, I’ve never tried doing the famous kaggle project of Titanic. I can still be considered beginner so this project suits me.
There is a good description of the competition (data sets used, attributes…) in the web, here I’ll write briefly about my approach to the project.
Workflow

I tried to follow the typical machine learning workflow but given that my main objective was to practice with the python library sklearn I focused more on the model selection phase. I was aware that doing more feature engineering would have improved my final model, but I was fine with that. In my future projects I’ll spend more time on that part of the process.
Most of the algorithm building is data-driven, meaning that the choice of the imputing method, scaling method… was made based on cross-validation scores. I ended up with three algorithms that performed similar: logistic regression model, K-nearest neighbors classifier and decision tree classifier. I “combined” them using a simple voting classifier to form the final model.
The accuracy obtainded in the test set was approximately 0.78
The jupyter notebook with more detailed description of the steps can be found here.