2.1 - What is Statistical Learning?

2.2 - Assessing Model Accuracy

2.4 - Exercises

  1. Inflexible or flexible method?
    1. sample size n is large, number of predictors p is small - flexible method will be better because we have a lot of training data.
    2. p is extremely large, n is small - inflexible method is better because we do not want high variance in the model from the small number of data points
    3. relationship is highly non-linear - flexible methods will reduce bias and give an accurate model, but we must also have a lot of training samples to reduce variance.
    4. variance of error terms is extremely high - inflexible methods are better because they have low variance and do not change too much from fluctuations in a training data point that might be caused due to the error.
  2. Classification or Regression? Inference or prediction? n and p?
    1. collect data on top 500 firms - profit, employees, industry and CEO salary. Trying to understand which factors affect CEO salary?
    1. new product either success or failure. collect data on 20 products - success or fail, price, marketing budget, competition price, and ten other variables
    1. predicting the % change in US dollar in relation to weekyl changes in world stock markets. Collect weekly data for all of 2012 - % change in dollar, % change US market, % change in British market, % change German market
  3. bias-variance decomposition
    1. draw out curves of Bias^2, Variance, Irreducible error, Training error, Testing error

      image-1636223743561.jpg2973831756676850007.jpg

    2. explain

      1. Bias^2 decreases as model flexbility increases, because a more flexible model does not lose accuracy when estimating the real world with a model
      2. Variance increases as model flexibility increases because a more flexible model can produce different results on different training sets
      3. Irreducible error (Bayes error rate) is just a horizontal line that denotes the lower bound on the test error
      4. Test error = Bias^2 + Variance + Irreducible error
      5. Training error approaches 0 as the model gets more flexible and overfits to the training data.
  4. real-life examples
    1. classification - cancer or not based on blood test results, product is a success or failure based on other similar products, type of bread based on picture; email spam or not spam
    2. regression - home prices based on location, etc.; how much weight somebody can lift based on height, weight; number of people at a concern based on prior concert tickets and location
    3. cluster analysis - genre based on movie sales data, fashion trends based on price and stuff,