Homework 2: Key concepts/ideas to keep in mind 🧑‍🏭

<aside> 📚

My Notebook in Github in first:

</aside>

<aside> 📚

The questions are framed in white font.

</aside>

<aside> 💡

The key ideas to keep in mind in yellow.

</aside>

<aside> 📚

There's one column with missing values. What is it?

</aside>

<aside> 💡

Importance of EDA for looking at the distribution of the fuel_efficiency: our target
Search for the missing values
Investigate for knowing the measures of central tendency and dispersion (mode, median, mean) of the features
Spot the duplicates if necessary </aside>

<aside> 📚

What's the median (50% percentile) for variable 'horsepower'?

</aside>

<aside> 💡

One of the measure to use to understand a feature and its distrib: always have in mind the others (mode, mean, standard deviation,range of values and quartiles)

Shuffle the dataset with a seed number for keeping the order and retrieve your index
Split your data in train/val/test sets, with 60%/20%/20% distribution for the 1st part of the process.
Never forget to delete the target column of the Xtrain,Xval & Xtest
Make functions with fixed data (of the final dataset for training: columns/categories) to prepare the data
No transformations made on the entire dataset: beware to data leakage </aside>

<aside> 📚

We need to deal with missing values for the column from Q1.
We have two options: fill it with 0 or with the mean of this variable.
Try both options. For each, train a linear regression model without regularization using the code from the lessons.
For computing the mean, use the training only!
Use the validation dataset to evaluate the models and compare the RMSE of each option.
Round the RMSE scores to 2 decimal digits using round(score, 2)

Which option gives better RMSE?

</aside>