Hotel Review Classification Project: The SVM Model
This document outlines the purpose, mechanics, and implementation plan for using a Support Vector Machine (SVM) to classify hotel review sentiment based purely on text content.
1. Understanding the Support Vector Machine (SVM)
The SVM is a powerful, supervised machine learning algorithm used for classification. Its primary goal is to find the most optimal boundary to separate different classes of data.
Key Concepts
- The Hyperplane (Decision Boundary): In simple terms, this is the line (or plane in higher dimensions) that best separates the positive reviews from the negative reviews in the dataset.
- Support Vectors: These are the data points (reviews) that lie closest to the hyperplane. They are the most crucial and difficult-to-classify examples, and the SVM uses only these points to define the decision boundary, making the model efficient.
- The Margin: The distance between the hyperplane and the nearest support vectors. The SVM's objective is to maximize this margin to ensure the best possible separation between the classes.
2. SVM Applied to Our Dataset
The SVM cannot process raw text; it requires numerical input. The following steps prepare the text for the SVM classifier.
A. Feature Extraction (TF-IDF Vectorization)
- Text Preprocessing: The raw review text is cleaned by removing punctuation and irrelevant words (stop words like "the," "is," "a").
- Vectorization: We use the TF-IDF (Term Frequency-Inverse Document Frequency) technique. This converts the clean text into numerical vectors (lists of numbers).
- The Input (X): Each review is now a vector, where the values represent how important each word is in distinguishing that review from all other reviews.
B. Defining the Clean Labels (Y)
We are using a Supervised Learning approach, meaning the SVM learns from pre-labeled examples. Based on our data analysis and the need to avoid noisy star ratings, we define our Binary Classification labels:
| Raw Rating |
New Classification |
Status in Project |
| 4 & 5 |
Positive (Label 1) |
Used as the target positive class. |
| 1 & 2 |
Negative (Label 0) |
Used as the target negative class. |
| 3 or > 5 |
Filtered Out |
Removed from the initial analysis to focus on clean, decisive sentiment signals. |
3. The Project Plan and Strategy
Our plan addresses the core challenge of our data: the extreme imbalance between the positive and negative classes (roughly 100:1).
Phase 1: Data Preparation for Binary Model (Baseline)