California Housing Price Prediction Report

📌 Project Objective

The goal of this project is to analyze and predict California housing prices using the California Housing Dataset. The objective is to understand the key factors affecting house prices and build a predictive model with solid performance.

📊 Exploratory Data Analysis (EDA)

Dataset Summary

Total samples: 20,640
Total features: 10 (excluding target)
Target: median_house_value
Dataset Info:

#	Column	Non-Null Count	Dtype	Note
0	longitude	20,640	float64
1	latitude	20,640	float64
2	housing_median_age	20,640	float64
3	total_rooms	20,640	float64
4	total_bedrooms	20,433	float64	Non-Null values ~= 20433 which indicates having ~= 277 missing values
5	population	20,640	float64
6	households	20,640	float64
7	median_income	20,640	float64
8	median_house_value	20,640	float64
9	ocean_proximity	20,640	object	Categorical Feature

Histogram Analysis

Most numerical features like total_rooms, total_bedrooms, population, and households are right-skewed, suggesting the presence of outliers and potential need for log transformation.
median_income is also right-skewed, with most values clustered between 1.5 and 6.
housing_median_age shows a uniform-like distribution with a cap at 52, possibly due to data limitations.
median_house_value (target) is capped at 500,000, indicating a price ceiling that may affect model performance.
longitude and latitude are evenly distributed, representing geographic spread without significant skewness.

Histogram Plots.png

Train-Test Splitting Strategy

To ensure the model is trained and tested on data that reflects the income distribution of the entire dataset, we applied stratified sampling based on median_income.

This helps maintain representative proportions across income categories in both training and testing sets, leading to more robust evaluation and better generalization.