Description

This project implements a machine learning system to detect fraudulent credit card transactions using supervised learning algorithms. The system analyzes transaction patterns from a highly imbalanced dataset and compares the performance of multiple classification models including Logistic Regression, Random Forest, and XGBoost. The dataset is processed using SMOTE to handle class imbalance, and models are evaluated using metrics optimized for fraud detection such as F1-Score and ROC-AUC.

Requirements

Python 3.x
pandas
scikit-learn
imbalanced-learn (SMOTE)
xgboost
matplotlib & seaborn

Structure

project/
│   fraud_detection_code.ipynb
│   requirements.txt
│   creditcard_ds.csv
│   README.md

Configuration

First, I imported the necessary libraries for data manipulation, visualization, and machine learning. The key libraries used are pandas for data handling, scikit-learn for machine learning models and preprocessing, imbalanced-learn for handling class imbalance with SMOTE, and XGBoost for gradient boosting.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, roc_curve

Development

1. Data Loading and Exploration

The dataset was loaded from an S3 bucket using pandas. It contains credit card transactions with 30 features: Time (seconds elapsed since first transaction), Amount (transaction value), and 28 anonymized features (V1-V28) resulting from PCA transformation to protect user privacy. The target variable 'Class' indicates fraud (1) or legitimate transaction (0).

bucket = "fmuruchi-credit-card-fraud-detection"
key = "creditcard_ds.csv"
s3_path = f"s3://{bucket}/{key}"
df = pd.read_csv(s3_path, storage_options={"anon": False})

2. Data Visualization

The dataset exhibits severe class imbalance, with fraudulent transactions representing only 0.17% of all transactions. Visualization of the class distribution revealed this critical challenge that needed to be addressed before model training. The Amount feature showed that fraudulent transactions tend to have different patterns compared to legitimate ones, with most fraudulent transactions being relatively small amounts.