In this chapter we’re going to focus on a car price prediction project and its implementation.
For the data we are going to pick up a dataset on Kaggle:
We choose a notebook for loading data.
We can use local Jupyter notebooks, Kaggle notebooks or Google Colab.
The last two needs a little experience in Bash for mastering all the benefits of the instance where you are (Linux machine).
For loading data you can use magic commands of a Jupyter notebook in a cell:
data = "linkwithdata.csv"
! wget $data # ! for a magic cell in a notebook
# wget linux command for download a http resource
# $data targets the "data" python variable and its string value
When you are going to manipulate data you can load Pandas and Numpy straight away without thinking.
import pandas as pd
import numpy as np
For a clean dataset with easy manipulation:
- normalize columns names & data which are “strings”
# replace spaces by "_" in columns names for more convenience
df.columns = df.columns.str.lower().str.replace(" ","_")
# Create a list of all object columns of the dataset
str_col = list(df.dtypes[df.dtypes == object].index)
# All spaces are replaced by an underscore
for col in str_col:
df[col] = df[col].str.lower().str.replace(" ","_")
There is a million things that we can do during this part, it depends to the nature of the project and the depth we want to reach.
To understand the dataset & to act in consequences, we can begin to have a sight on each column: number of categories/values and the most frequent, the distribution, some bivariate analysis, correlation between variables…
IN our case :
# Iterate in the columns
for col in df.columns:
print(col)
print()
# categories or values the most frequent
print(df[column].value_counts().head(5))
#Number of unique values
print((df[col].nunique()))
print()
print()
And when we want to represent a distribution we generally choose Seaborn and/or Matplotlib.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.histplot(df.msrp,kde=True)
plt.xlim(0, 200_000)
plt.show()
Distribution of the feature msrp
And we see what we call a right-skewed distribution ⇒ instances distributed to the left with a long tail going to the right. Models don’t like this kind of distrib. ,to learn correctly they prefer normal distribution by example.
To handle this, we can apply a transformation by logarithm for the entire column.
It can “normalize” the distribution.
price_logs = np.log1p(df.msrp)
price_logs