Predicting Credit Card Churn

Keywords: data science pipeline, dynamic visualisation, one hot encoding, XGBoost, Random Forest, logistic regression, KNN

This project is hosted on GitHub. It was done as part of the requirements for the module CE9010 Introduction to Data Analysis at NTU, along with my group members Zhao Zhun and Oh Bing Quan by collaborating on GitHub.

Credit card churning is the problem of customers rapidly opening and closing credit card accounts to use the rewards. The dataset we used was taken from Kaggle, and it contained information on the credit card customers and their attrition status.

Our project involved going through the entire data science pipeline: from acquiring the data, cleaning, visualising and pre-processing it to running different models, tuning their hyperparameters, making predictions and evaluating them based on various metrics. I mainly worked on data exploration.

First, I looked at the features: out of 21, one was the target feature, 19 were features with predictive power, and client ID, which had no predictive power, was dropped. After constructing a table with the range of values taken by each feature and their descriptions, I moved to visualising the data.

For univariate visualisations, I plotted bar plots for the categorical data and histograms for the numerical data. The dataset was found to be skewed towards customers who didn’t churn, and age, as expected, was normally distributed.

I looked at relative frequency distributions of pairs of features for bivariate visualisations. Some of the inferences were that the distribution of age was similar for both existing and churned customers, and that customers who made transactions of smaller amounts were more likely to churn.

For multivariate visualisation, I made interactive 3-dimensional scatterplots. I found that customers with a lower credit limit tend to have a higher utilization ratio, and this trend is uniform across churned and existing customers.

My group members majorly worked on the rest of the tasks, and I assisted them in it. The next steps were data-preprocessing: where categorical variables were converted to their one hot encoded versions, and numerical variables were normalized based on their means and standard deviations.

Four machine learning models were compared: logistic regression, random forest, K-nearest neighbors, and XGBoost. Grid search and random search were used to tune the hyperparameters of the models based on the area under the ROC curve for the validation set, and the best models were trained on the training set to make predictions. These models were evaluated based on their F1 score, AUC, and accuracy.

The XGBoost and random forest models were the best performers in predicting credit card churn.

Previous
Previous

Forecasting Singapore’s Retail Sales Index

Next
Next

Battleship