What is the general relationship between the quantitative response Y and p different predictors X1, X2, ..., Xp? What do the different parts mean?
What are the two main problems statistical learning solves?
Parametric vs non-parametric models?
What is the tradeoff between prediction accuracy and model interpretability?
2.2 - Assessing Model Accuracy
How do we measure quality of fit for regression problems?
What is the Bias-Variance tradeoff?
What is the goal when trying to find a model in terms of bias and variance?
How do we measure model accuracy in a classification problem?
What is a Bayes Classifier and why do we care about it?
Why can't we use the Bayes Classifier in all settings?
What values of KNN affect flexibility?
2.4 - Exercises
Inflexible or flexible method?
sample size n is large, number of predictors p is small - flexible method will be better because we have a lot of training data.
p is extremely large, n is small - inflexible method is better because we do not want high variance in the model from the small number of data points
relationship is highly non-linear - flexible methods will reduce bias and give an accurate model, but we must also have a lot of training samples to reduce variance.
variance of error terms is extremely high - inflexible methods are better because they have low variance and do not change too much from fluctuations in a training data point that might be caused due to the error.
Classification or Regression? Inference or prediction? n and p?
collect data on top 500 firms - profit, employees, industry and CEO salary. Trying to understand which factors affect CEO salary?
Regression problem - CEO salary is a quantitative output variable
Inference - trying to understand how the different predictors affect output
n = 500
p = 3 (CEO Salary is output variable and not a predictor)
new product either success or failure. collect data on 20 products - success or fail, price, marketing budget, competition price, and ten other variables
Classification problem - trying to classify the new product
Prediction problem - based on previous data, predict whether new product will succeed
n = 20
p = 13
predicting the % change in US dollar in relation to weekyl changes in world stock markets. Collect weekly data for all of 2012 - % change in dollar, % change US market, % change in British market, % change German market
Regression problem
Combination of prediction and inference - want to figure out the result as well as what causes it
n = 52 weeks
p = 3 (% change in dollar is the output variable)
bias-variance decomposition
draw out curves of Bias^2, Variance, Irreducible error, Training error, Testing error
explain
Bias^2 decreases as model flexbility increases, because a more flexible model does not lose accuracy when estimating the real world with a model
Variance increases as model flexibility increases because a more flexible model can produce different results on different training sets
Irreducible error (Bayes error rate) is just a horizontal line that denotes the lower bound on the test error
Test error = Bias^2 + Variance + Irreducible error
Training error approaches 0 as the model gets more flexible and overfits to the training data.
real-life examples
classification - cancer or not based on blood test results, product is a success or failure based on other similar products, type of bread based on picture; email spam or not spam
regression - home prices based on location, etc.; how much weight somebody can lift based on height, weight; number of people at a concern based on prior concert tickets and location
cluster analysis - genre based on movie sales data, fashion trends based on price and stuff,