What's special about Kaggle?
All the code is open. This makes it easy to analyze which algorithms win most frequently, and how exactly they differ from what we've implemented in CS 66.
The problems are real-world and significantly incentivized with prize money, making them representative of the sate of the art in real world applications.
Amazingly, winning solutions converge to highly similar architectures according to Kaggle founder Ben Hamner. This lends confidence to the assertion that winning solutions are near state of the art.
Neural nets and GBMs (Gradient Boosted Machines, the generic name for gradient boosting) have surged in popularity. Random forests are still in use, often as a feature engineering layer for ensemble classification.
Note: percentages sum to >100% because many solutions make use of multiple algorithms.
Which algorithms win?
Kaggle founder Ben Hamner asserts that which algorithms win is entirely dependent on the dataset, particularly structured data vs unstructured data.
- Structured data
- Structured data includes CSV type formats with multiple features, each with categorical and/or continuous values - gradient boosted decision trees (implemented in XGBoost) are most effective.
- Using structured data sets and decision trees, the majority of a programmer's effort (often as much of 80% of their time, according to anecdotal evidence across hundreds of Kaggle submissions) goes towards exploring the data and engineering features, described below, to be more predictive.
- Unstructured data
- Unstructured data includes data like images, sound waves, EEG, fMRI data and even passages of text. Unstructured data has few discrete features and unstructured representations - neural networks (of many different varieties) are best for this.
- On unstructured datasets one spends the most effort on architecting the machine learning model and tuning the hyperparameters, virtually no time is spent on feature engineering.
Feature engineering is the art exploring the data and using human intuition to combine, remove and change the features of the dataset to be maximally predictive before fitting the model.
The goal: reduce the dimensionality of the data, creating the most information-rich features possible. The following are the best currently used methods for feature engineering which should demystify the practice.
Eliminate useless and noisy features
Reducing the dimensionality of the data is always a plus, as long as you aren't throwing out unique and informative data.
- Remove useless and noisy features to reduce dimensionality.
- Extra features can decrease performance because they may “confuse” the model by giving it irrelevant data that prevents it from learning the actual relationships.
- Random forest do some amount of implicit feature selection (by picking the feature with max info gain), but other models do not.
- Improve model runtime
- Even consider removing helpful features if the decrease in model quality is tolerable, in order gain dramatic improvements in runtime. In the titanic example, feature reduction hurt the ROC curve 0.13% but reduced runtime by 35%.