Personal notes from chapter 2: Car price prediction project

In this chapter we’re going to focus on a car price prediction project and its implementation.

For the data we are going to pick up a dataset on Kaggle:

1) Data preparation

We choose a notebook for loading data.

We can use local Jupyter notebooks, Kaggle notebooks or Google Colab.

The last two needs a little experience in Bash for mastering all the benefits of the instance where you are (Linux machine).

For loading data you can use magic commands of a Jupyter notebook in a cell:

data = "linkwithdata.csv"
! wget $data    # ! for a magic cell in a notebook
# wget linux command for download a http resource
# $data targets the "data" python variable and its string value

When you are going to manipulate data you can load Pandas and Numpy straight away without thinking.

import pandas as pd
import numpy as np

For a clean dataset with easy manipulation:

- normalize columns names & data which are “strings”

# replace spaces by "_" in columns names for more convenience
df.columns = df.columns.str.lower().str.replace(" ","_")

# Create a list of all object columns of the dataset
str_col = list(df.dtypes[df.dtypes == object].index)
# All spaces are replaced by an underscore
for col in str_col:
    df[col] = df[col].str.lower().str.replace(" ","_")

2) Exploratory Data Analysis

There is a million things that we can do during this part, it depends to the nature of the project and the depth we want to reach.

To understand the dataset & to act in consequences, we can begin to have a sight on each column: number of categories/values and the most frequent, the distribution, some bivariate analysis, correlation between variables…

IN our case :

we’re looking at the principal categories of each columns(the most frequent values) and the number of different values:

# Iterate in the columns
for col in df.columns:
    print(col)
    print()
    # categories or values the most frequent
    print(df[column].value_counts().head(5))
    #Number of unique values
    print((df[col].nunique()))
    print()
    print()

We look at the distribution of our target to see what type of model can fit it and what transformation we can apply.

And when we want to represent a distribution we generally choose Seaborn and/or Matplotlib.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.histplot(df.msrp,kde=True)
plt.xlim(0, 200_000)
plt.show()

Distribution of the feature msrp

And we see what we call a right-skewed distribution ⇒ instances distributed to the left with a long tail going to the right. Models don’t like this kind of distrib. ,to learn correctly they prefer normal distribution by example.

To handle this, we can apply a transformation by logarithm for the entire column.

It can “normalize” the distribution.

price_logs = np.log1p(df.msrp)
price_logs