The goal of this project is to analyze and predict California housing prices using the California Housing Dataset
. The objective is to understand the key factors affecting house prices and build a predictive model with solid performance.
median_house_value
# | Column | Non-Null Count | Dtype | Note |
---|---|---|---|---|
0 | longitude | 20,640 | float64 | |
1 | latitude | 20,640 | float64 | |
2 | housing_median_age | 20,640 | float64 | |
3 | total_rooms | 20,640 | float64 | |
4 | total_bedrooms | 20,433 | float64 | Non-Null values ~= 20433 which indicates having ~= 277 missing values |
5 | population | 20,640 | float64 | |
6 | households | 20,640 | float64 | |
7 | median_income | 20,640 | float64 | |
8 | median_house_value | 20,640 | float64 | |
9 | ocean_proximity | 20,640 | object | Categorical Feature |
total_rooms
, total_bedrooms
, population
, and households
are right-skewed, suggesting the presence of outliers and potential need for log transformation.median_income
is also right-skewed, with most values clustered between 1.5 and 6.housing_median_age
shows a uniform-like distribution with a cap at 52, possibly due to data limitations.median_house_value
(target) is capped at 500,000, indicating a price ceiling that may affect model performance.longitude
and latitude
are evenly distributed, representing geographic spread without significant skewness.To ensure the model is trained and tested on data that reflects the income distribution of the entire dataset, we applied stratified sampling based on median_income
.
This helps maintain representative proportions across income categories in both training and testing sets, leading to more robust evaluation and better generalization.