Retail Demand Forecasting - MLZoomCamp Competition 2024 - A brief summary of my solution

Regarding the information provided by the competition

To start tackling this challenge, the first step I took was Exploratory Data Analysis (EDA). Therefore, I had to load all the datasets. The competition contains a total of 10 files in .csv format.

sales.csv : This file contains aggregated store sales for specific dates.
online.csv: This file contains aggregated online sales by store for specific dates.
markdowns.csv: This file provides data on products sold at markdown prices in each store.
price_history.csv: This file contains price changes data in each store.
discounts_history.csv: Contains historical promo data for each specific store.
actual_matrix.csv: Contains the list of products available in stores.
catalog.csv: Product catalog with characteristics.
stores.csv: Contains stores info data.
test.csv: Contains the test cases that participants are required to predict.
sample_submission.csv: A sample submission file to demonstrate the expected format for the final submission.

Regarding the datasets I used

The sample_submission.csv file gives us a format for submitting our results. This format indicates that our submission file must contain two columns: row_id and quantity . Therefore, our target is the quantity of sold items for each store.

For training my model, I used the following datasets: sales.csv , catalog.csv , and stores.csv. These datasets provide information about the stores and sold items from August 28, 2022, to September 26, 2024**,** and I needed to join them into a single dataset to include all available data.

The markdowns.csv and discounts_history.csv ****files weren’t used since markdowns and discounts are occasional events that do not repeat yearly.
Since test.csv provides the quantity for several items from September 27, 2024, to October 26, 2024, and price_history.csv and actual_matrix.csv provide information about items from August 28, 2022, to September 26, 2024**,** these datasets can’t be used based on the strategy I selected for training my model.

All datasets were explored, and I fixed all the problems I found, such as null values and outliers. Furthermore, I created an algorithm to translate the catalog.csv dataset since it contains data in Russian. This is not necessary for successfully training the model but is useful for data visualization.

The total dataset contained 7,640,752 records, so I decided to split it into two subsets: the training and validation sets, where the validation set contains 5% of the total records. All categorical columns were one-hot encoded, and all numerical columns, except boolean columns, were standardized.

Regarding the visualizations

I depicted multiple aspects of the dataset in relation to the target, as this way I can gain insights into which features explain the target variable better.

For example, in this graph, we can see that on Sundays, the demand for items is lower than on other days. Thanks to this graph, I developed a variable that identifies the items sold on Sundays.

The data visualization allowed me to better understand the data and provide more information to the model.

Regarding the strategy for model training

Until I know, there are two type of strategy for training a model with data temporal sorted.

The first strategy (which I used) consists of using linear regression for time series forecasting, where time dependency is not considered, and observations are assumed to be independent.
The second strategy also consists of using linear regression for time series forecasting, but in this case, time dependency is considered, making it inherently sequential, with each observation depending on the previous ones.