What's special about Kaggle?

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/813e01fe-bba9-4223-9238-a03977b81d02/Untitled.png

All the code is open. This makes it easy to analyze which algorithms win most frequently, and how exactly they differ from what we've implemented in CS 66.

The problems are real-world and significantly incentivized with prize money, making them representative of the sate of the art in real world applications.

Amazingly, winning solutions converge to highly similar architectures according to Kaggle founder Ben Hamner. This lends confidence to the assertion that winning solutions are near state of the art.

Neural nets and GBMs (Gradient Boosted Machines, the generic name for gradient boosting) have surged in popularity. Random forests are still in use, often as a feature engineering layer for ensemble classification.
Note: percentages sum to >100% because many solutions make use of multiple algorithms.

Neural nets and GBMs (Gradient Boosted Machines, the generic name for gradient boosting) have surged in popularity. Random forests are still in use, often as a feature engineering layer for ensemble classification. Note: percentages sum to >100% because many solutions make use of multiple algorithms.

Which algorithms win?

Kaggle founder Ben Hamner asserts that which algorithms win is entirely dependent on the dataset, particularly structured data vs unstructured data.

Feature Engineering

Feature engineering is the art exploring the data and using human intuition to combine, remove and change the features of the dataset to be maximally predictive before fitting the model.

The goal: reduce the dimensionality of the data, creating the most information-rich features possible. The following are the best currently used methods for feature engineering which should demystify the practice.

Eliminate useless and noisy features

Reducing the dimensionality of the data is always a plus, as long as you aren't throwing out unique and informative data.