Kaggle are running a regression machine learning competition with Mercedes-Benz right now, it closes in a week and runs for about 6 weeks overall. I’ve managed to squeeze in 5 days to have a play (I managed about 10 days on the previous Quora competition). My goal this time was to focus on new tools that make it faster to get to ‘pretty good’ ML solutions. Specifically I wanted to play with:
- TPOT “auto scikit-learn” (but not the auto-sklearn package which is related)
- The YellowBrick sklearn visualiser
Most of the 5 days were spent either learning the above tools or making some suggestions for YellowBrick, I didn’t get as far as creative feature engineering.
Currently I’m in the top 50th percentile Now the competition has finished I’m at rank 1497 (top 37th percentile) on the leaderboard using raw features, some dimensionality reduction and various estimators, with 5 days of effort.
TPOT is rather interesting – it uses a genetic algorithm approach to evolve the hyperparameters of one or more (Stacked) estimators. One interesting outcome is that TPOT was presenting good models that I’d never have used – e.g. an AdaBoostRegressor & LassoLars or GradientBoostingRegressor & ElasticNet.
TPOT works with all sklearn-compatible classifiers including XGBoost (examples) but recently there’s been a bug with n_jobs and multiple processes. Due to this the current version had XGBoost disabled, it looks now like that bug has been fixed. As a result I didn’t get to use XGBoost inside TPOT, I did play with it separately but the stacked estimators from TPOT were superior. Getting up and running with TPOT took all of 30 minutes, after that I’d leave it to run overnight on my laptop. It definitely wants lots of CPU time. It is worth noting that auto-sklearn has a similar n_jobs bug and the issue is known in sklearn.
It does occur to me that almost all of the models developed by TPOT are subsequently discarded (you can get a list of configurations and scores). There’s almost certainly value to be had in building averaged models of combinations of these, I didn’t get to experiment with this.
Having developed several different stacks of estimators my final combination involved averaging these predictions with the trustable-model provided by another Kaggler. The mean of these three pushed me up to 0.55508. My only feature engineering involved various FeatureUnions with the FunctionTransformer based on dimensionality reduction.
YellowBrick was presented at our PyDataLondon 2017 conference (write-up) this year by Rebecca (we also did a book signing). I was able to make some suggestions for improvements on the RegressionPlot and PredictionError along with sharing some notes on visualising tree-based feature importances (along with noting a demo bug in sklearn). Having more visualisation tools can only help, I hope to develop some intuition about model failures from these sorts of diagrams.
Here’s a ResidualPlot with my added inset prediction errors distribution, I think that this should be useful when comparing plots between classifiers to see how they’re failing:
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.