Regarding the information provided by the competition

To start tackling this challenge, the first step I took was Exploratory Data Analysis (EDA). Therefore, I had to load all the datasets. The competition contains a total of 10 files in .csv format.

image.png

Regarding the datasets I used

The sample_submission.csv file gives us a format for submitting our results. This format indicates that our submission file must contain two columns: row_id and quantity . Therefore, our target is the quantity of sold items for each store.

For training my model, I used the following datasets: sales.csv , catalog.csv , and stores.csv. These datasets provide information about the stores and sold items from August 28, 2022, to September 26, 2024**,** and I needed to join them into a single dataset to include all available data.

All datasets were explored, and I fixed all the problems I found, such as null values and outliers. Furthermore, I created an algorithm to translate the catalog.csv dataset since it contains data in Russian. This is not necessary for successfully training the model but is useful for data visualization.

The total dataset contained 7,640,752 records, so I decided to split it into two subsets: the training and validation sets, where the validation set contains 5% of the total records. All categorical columns were one-hot encoded, and all numerical columns, except boolean columns, were standardized.

Regarding the visualizations

I depicted multiple aspects of the dataset in relation to the target, as this way I can gain insights into which features explain the target variable better.

image.png

For example, in this graph, we can see that on Sundays, the demand for items is lower than on other days. Thanks to this graph, I developed a variable that identifies the items sold on Sundays.

The data visualization allowed me to better understand the data and provide more information to the model.

Regarding the strategy for model training

Until I know, there are two type of strategy for training a model with data temporal sorted.