Hotel Review Classification Project: The SVM Model

This document outlines the purpose, mechanics, and implementation plan for using a Support Vector Machine (SVM) to classify hotel review sentiment based purely on text content.

1. Understanding the Support Vector Machine (SVM)

The SVM is a powerful, supervised machine learning algorithm used for classification. Its primary goal is to find the most optimal boundary to separate different classes of data.

Key Concepts

2. SVM Applied to Our Dataset

The SVM cannot process raw text; it requires numerical input. The following steps prepare the text for the SVM classifier.

A. Feature Extraction (TF-IDF Vectorization)

  1. Text Preprocessing: The raw review text is cleaned by removing punctuation and irrelevant words (stop words like "the," "is," "a").
  2. Vectorization: We use the TF-IDF (Term Frequency-Inverse Document Frequency) technique. This converts the clean text into numerical vectors (lists of numbers).
  3. The Input (X): Each review is now a vector, where the values represent how important each word is in distinguishing that review from all other reviews.

B. Defining the Clean Labels (Y)

We are using a Supervised Learning approach, meaning the SVM learns from pre-labeled examples. Based on our data analysis and the need to avoid noisy star ratings, we define our Binary Classification labels:

Raw Rating New Classification Status in Project
4 & 5 Positive (Label 1) Used as the target positive class.
1 & 2 Negative (Label 0) Used as the target negative class.
3 or > 5 Filtered Out Removed from the initial analysis to focus on clean, decisive sentiment signals.

3. The Project Plan and Strategy

Our plan addresses the core challenge of our data: the extreme imbalance between the positive and negative classes (roughly 100:1).

Phase 1: Data Preparation for Binary Model (Baseline)