Jan Hendrik Metzen

My personal blog on python and machine learning

Naive Bees Classifier for the The Metis Challenge

This is a documentation of my submission to the Naive Bees classification challenge, where I ended up on the second place (username frisbee). The challenge of the competition was to classify whether a bee is a honey bee (Apis) or a bumble bee (Bombus). According to the organizers, being able to identify bee species from images is a task that ultimately would allow researchers to more quickly and effectively collect field data. Pollinating bees have critical roles in both ecology and agriculture, and diseases like colony collapse disorder threaten these species. You can learn more about the challenge under http://www.drivendata.org/competitions/8/

Additional Kernels for sklearn's new Gaussian Processes

Starting from version 0.18 (already available in the post-0.17 master branch), scikit-learn will ship a completely revised Gaussian process module, supporting among other things kernel engineering. While scikit-learn only ships the most common kernels, the gp_extra project contains some more advanced, non-standard kernels that can seamlessly be used with scikit-learn's GaussianProcessRegressor. This post summarizes the current contents of the package and provides some examples.

Probability calibration

This post summarizes the new feature of calibrating predicted probabilities of binary and multi-class classifiers, which has been added in the 0.16 release of scikit-learn. It gives several examples, which illustrates both the different properties of under-confident and over-confident classifiers, and how to calibrate those such that they become well-calibrated.

Advice for applying Machine Learning

This post summarizes some recommendations on how to get started with machine learning on a new problem. This includes ways of visualizing your data, choosing a machine learning method suitable for the problem at hand, identifying and dealing with over- and underfitting, dealing with large (read: not very small) datasets, and pros-and-cons of different loss functions.