In a data-driven marketing environment, identifying which leads are most likely to convert is crucial for optimizing sales and resource allocation. This is the essence of lead scoring — assigning a probability that a potential customer will take a desired action (e.g., sign up, buy, or subscribe).
In this project, we’ll walk through a lead scoring system based on an online course.
You can find the dataset at:
https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
The goal is to predict whether a lead will convert based on demographic and behavioral features (such as number of courses viewed, interaction count, location and annual income).
We’ll use a public dataset and develop a reproducible ML pipeline to:
The dataset is loaded directly from GitHub and inspected for missing values.
Good ML engineering practice starts with clear data handling rules:
Split the columns of your dataset according to their data type category:
categorical = df.dtypes[df.dtypes == object].index.tolist()
numerical = list(set(df.columns) - set(categorical))