https://s3-us-west-2.amazonaws.com/secure.notion-static.com/3a371bb0-968f-4872-9af4-d92348560ff1/feature-scaling-andrewng.mp4

Feature Scaling

Feature scaling is a way of transforming your data into a common range of values. There are two common scalings:

  1. Standardizing
  2. Normalizing

Standardizing

Standardizing is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column. In Python, let's say you have a column in df called height. You could create a standardized height as:

df["height_standard"] = (df["height"] - df["height"].mean()) / df["height"].std()

By doing this all variables in the data set have equal means (= 0) and standard deviations (= 1) but different ranges.

In the new standardized column each value is a comparison to the mean of the column, and a new, standardized value can be interpreted as the number of standard deviations the original height was from the mean. This type of feature scaling is by far the most common of all techniques (for the reasons discussed here, but also likely because of precedent).

Normalizing

A second type of feature scaling that is very popular is known as normalizing. With normalizing, data are scaled between 0 and 1. Using the same example as above, we could perform normalizing in Python in the following way:

df["height_normal"] = (df["height"] - df["height"].min()) /     \\
                      (df["height"].max() - df['height'].min())

Other Methods

The appropriate standardization method depends on your data set and the conventions of your particular field of study. Examples of papers that discuss standardization include Gower (1985)Johnson and Wichern (1992)Everitt (1993), and van Tongeren (1995). In addition, Milligan and Cooper (1988) present an in-depth examination of standardization of variables when using Euclidean Distance as the dissimilarity metric.

Remember, if you choose to use the Steinhaus Coefficient of Similarity (recommended for count data, such as the number of trees of different species at sampled locations), this measure is self-normalizing and data should not be standardized.