Feature Engineering + Scaling

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/3a371bb0-968f-4872-9af4-d92348560ff1/feature-scaling-andrewng.mp4

Feature Scaling

Feature scaling is a way of transforming your data into a common range of values. There are two common scalings:

Standardizing
Normalizing

Standardizing

Standardizing is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column. In Python, let's say you have a column in df called height. You could create a standardized height as:

df["height_standard"] = (df["height"] - df["height"].mean()) / df["height"].std()

By doing this all variables in the data set have equal means (= 0) and standard deviations (= 1) but different ranges.

In the new standardized column each value is a comparison to the mean of the column, and a new, standardized value can be interpreted as the number of standard deviations the original height was from the mean. This type of feature scaling is by far the most common of all techniques (for the reasons discussed here, but also likely because of precedent).

Normalizing

A second type of feature scaling that is very popular is known as normalizing. With normalizing, data are scaled between 0 and 1. Using the same example as above, we could perform normalizing in Python in the following way:

df["height_normal"] = (df["height"] - df["height"].min()) /     \\
                      (df["height"].max() - df['height'].min())

Other Methods

The appropriate standardization method depends on your data set and the conventions of your particular field of study. Examples of papers that discuss standardization include Gower (1985), Johnson and Wichern (1992), Everitt (1993), and van Tongeren (1995). In addition, Milligan and Cooper (1988) present an in-depth examination of standardization of variables when using Euclidean Distance as the dissimilarity metric.

Remember, if you choose to use the Steinhaus Coefficient of Similarity (recommended for count data, such as the number of trees of different species at sampled locations), this measure is self-normalizing and data should not be standardized.

0-1 scaling: each variable in the data set is recalculated as (V - min V)/(max V - min V), where V represents the value of the variable in the original data set. This method allows variables to have differing means and standard deviations but equal ranges. In this case, there is at least one observed value at the 0 and 1 endpoints.
Dividing each value by the range: recalculates each variable as V /(max V - min V). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar.
Z-score scaling: variables recalculated as (V - mean of V)/s, where "s" is the standard deviation. As a result, all variables in the data set have equal means (0) and standard deviations (1) but different ranges.
Dividing each value by the standard deviation. This method produces a set of transformed variables with variances of 1, but different means and ranges.