Predicting Customer Churn

In this project, I explore customer churn prediction using a dataset with more than 20 diverse features. I compare multiple machine learning models to evaluate which algorithm most effectively captures the complex patterns within the data. I find that an XG Boost model delivers the strongest predictive performance. The full GitHub repository can be found here: Predicting Customer Credit Card Churn Notebook

Motivation & Objectives

Logistic Regression

Random Forest

XGBoost

August 2025

Credit cards contribute significantly to a bank's profitability, in the form of interest payments, merchant fees, and annual fees. Given the substantial revenue generated from credit card services, it is crucial for banks to retain their credit card users and minimize customer attrition.

This project analyzed customer data from an undisclosed bank that is facing challenges in retaining its credit card users. By utilizing customer data, including customer demographic and spending attributes, three predictive models were developed to identify and understand customer churn. The results of this analysis address the following questions:

Which model best predicts customer churn?
Which model best fits this specific business context?
Which customer attributes are most strongly correlated with churn?

By identifying the drivers of attrition, the bank will gain valuable insights into its customer base and be able to design targeted, proactive strategies aimed at preventing customers from closing their credit card accounts. The insights found will enable the bank to better retain its credit card users and thus maintain a key source of revenue.

Exploratory Data Analysis

The data used for this analysis is secondary data from an undisclosed bank, sourced from Kaggle . The dataset contains 10,127 records of a bank’s customers and includes 21 relevant variables, including a unique customer identification number, customer demographic variables, and various customer banking attributes. Table 1 presents the summary statistics of the continuous variables in the dataset.

Table 1 - Summary Statistics

A series of boxplots reveals multiple outliers within the dataset. Upon closer examination, these outliers are deemed valid, as they fall within the acceptable ranges for their respective variables. The outliers in Customer_Age include two observations with ages of 70 and 73, which are still plausible representations of the bank's customer base. Similarly, the values 0, 5, and 6 are considered outliers for the variables Months Inactive and Contacts with the Bank in the Last 12 Months. However, these values remain valid and consistent with the expected scale of 0 to 6. The heavily right-skewed variables, including Credit Limit, Total Transaction Amount, Amount Change from Q4 to Q1, and Transaction Count Change from Q4 to Q1, display a substantial number of large-value outliers. After examining the ranges of these values, they are deemed valid outliers. The skewness of these variables will be addressed during the preprocessing step to ensure that they do not unduly affect model performance. No outliers within the dataset appear to be the result of error, thus they are kept intact to reflect the real-word occurrences.

Boxplots

A correlation analysis was conducted using the Pearson correlation coefficient to identify any potential risks of multicollinearity among the predictor variables. Customer Age and Months on Book are highly correlated, with a Pearson correlation coefficient of 0.79, indicating a strong positive relationship, as an increase in customer age aligns with a longer duration of their relationship with the bank. Average Open to Buy and Credit Limit have a perfectly linear relationship with a correlation coefficient of 1.0. This is expected, as a customer's Open to Buy is directly dependent on their available Credit Limit. Average Utilization Ratio and Total Revolving Balance show a moderate positive correlation with a coefficient of 0.62. Total Transaction Count and Total Transaction Amount are highly correlated, with a coefficient of 0.81. The presence of these high correlations suggests potential redundancy in the information provided by these variables. This will aid in the feature selection process.

Correlation Matrix

Chi-square tests were conducted to further explore the relationships between customer demographics and attrition. The results indicate the following: There is a statistically significant correlation between attrition and gender (p-value = 0.0002), suggesting a strong association between the two variables. A statistically significant correlation was also found between attrition and income category, with a p-value of 0.025, indicating that income level may be an important factor in predicting attrition. Education level is just on the border of significance, with a p-value of 0.051, suggesting a weak association with attrition. Marital status and the specific credit card a customer holds were not significantly correlated with attrition, with p-values greater than 0.10. Only variables with significant chi-square tests will be considered in feature selection.

Data Preprocessing

Several steps were taken to transform the original data into a more structured and usable format for analysis. First, the dataset was split into a training set and a test set, with 80% of the original data allocated to the training set and 20% to the test set. Both the training and test sets were stratified based on the attrition flag variable to ensure that the distribution of the target variable was preserved in both sets.

The dataset comprises 8,500 existing customers (labeled as 0) and 1,627 churned customers (labeled as 1), with the Attrition_Flag column transformed into a binary target variable for modeling purposes.

Initially, we loaded the raw dataset, which included attributes such as CLIENTNUM, Attrition_Flag, demographic variables, and account activity metrics. A key preprocessing step involved identifying and encoding categorical features — specifically, Gender, Education_Level, Marital_Status, Income_Category, and Card_Category. These variables were converted into numerical representations using the pd.get_dummies() function in Python. To mitigate multicollinearity, the first category from each variable was dropped.

Subsequently, the dataset was verified to ensure all features were fully numeric, with no remaining categorical string values.

As previously mentioned, several variables in the dataset exhibit a highly right-skewed distribution with a large number of outliers. In regression models, skewed data can lead to poor generalization performance and inaccurate predictions. Moreover, the presence of outliers can negatively impact logistic regression models, increasing the likelihood of misclassifications. To mitigate the effects of both skewness and outliers, a log transformation was applied to variables exhibiting severe skewness, including Credit_Limit, Total_Trans_Amt, Total_Ct_Chng_Q4_Q1, and Total_Amt_Chng_Q4_Q1. Log transformations help reduce skewness by compressing the scale of larger values, making the distribution more symmetrical (Frost, n.d., a).

The dataset contains a large number of continuous variables, each with varying scales. Maintaining the original scale of the variables within the regression analysis may lead to misleading results, due to the regression placing more weight on the variables with larger scales (e.g., Credit Limit) than compared to variables with smaller scales (e.g., Customer Age). Standardization helps to equalize the contributions of each variable to the model (Frost, n.d., b). While standardization is not explicitly required for logistic regression, it can improve the optimization process, leading to faster convergence. The following variables were standardized using the StandardScaler tool in Python: Customer_Age, Credit_Limit, Total_Trans_Amt, Total_Amt_Chng_Q4_Q1, Total_Ct_Chng_Q4_Q1, Total_Revolving_Bal. A new scaled version of each variable was added to the data frame, in which they had a mean of zero and standard deviation of one.

Modeling Phase

Three models were developed to predict customer credit card churn.

Logistic Regression: This baseline model is well-suited for binary classification problems. It applies the sigmoid function to estimate the probability of churn, mapping outputs to a range between 0 and 1, and provides interpretable coefficients for feature impact (Tan et al., 2019).
Random Forests: An ensemble learning method that combines multiple decision trees to improve predictive accuracy and robustness. Each tree is trained on a random subset of data and features, reducing overfitting and enhancing generalization. Final predictions are based on majority voting across trees (Tan et al., 2019).
XGBoost: An advanced gradient boosting algorithm that builds trees sequentially, with each tree correcting the errors of the previous one. XGBoost optimizes performance using regularization, shrinkage, and feature subsampling, making it highly effective for handling imbalanced data and capturing complex feature interactions (Tan et al., 2019).