<aside>
π
My Notebook in Github in first:
https://github.com/DataScienceMyLove/homeworks-ml-zoomcamp-2025/blob/main/02-linear_regression/homework-2.ipynb
</aside>
<aside>
π
The questions are framed in white font.
</aside>
<aside>
π‘
The key ideas to keep in mind in yellow.
</aside>
<aside>
π
Question 1
There's one column with missing values. What is it?
</aside>
<aside>
π‘
- Importance of EDA for looking at the distribution of the fuel_efficiency: our target
- Search for the missing values
- Investigate for knowing the measures of central tendency and dispersion (mode, median, mean) of the features
- Spot the duplicates if necessary
</aside>
<aside>
π
Question 2
What's the median (50% percentile) for variable 'horsepower'?
</aside>
<aside>
π‘
- One of the measure to use to understand a feature and its distrib: always have in mind the others (mode, mean, standard deviation,range of values and quartiles)
Prepare and split the dataset
- Shuffle the dataset with a seed number for keeping the order and retrieve your index
- Split your data in train/val/test sets, with 60%/20%/20% distribution for the 1st part of the process.
- Never forget to delete the target column of the Xtrain,Xval & Xtest
- Make functions with fixed data (of the final dataset for training: columns/categories) to prepare the data
- No transformations made on the entire dataset: beware to data leakage
</aside>
<aside>
π
Question 3
- We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
Which option gives better RMSE?
</aside>