There are 2 types of time series data that we store: actual data and predicted data.

For time series data, the fields could be divided into the following categories:

For predicted time series data, additional categories of fields are added:

Let's say we have the cargo flow data of 10 customers for 10 origins going to 10 destinations and they are shipping 10 kinds of products. Each row stores the quantity of product shipped from 1 origin to 1 destination for one product of 1 customer on a particular date. There would be around 1010101030 = 300k rows for 30-day worth of data (actual number of rows may vary because not all dates have shipments for all the combinations). On top of that, let's assume we make predictions beginning of every week for the next 3 months. So it will take

  1. Around 300k*12 = 3.6 million rows to store 1-year worth of actual cargo flow data
  2. Around 300k352 = 46.8 million rows to store 1-year worth of predicted cargo flow data

Normally, for the time series we store, we have several years' worth of actual data to begin with. The volume of this data is enough to hurt the query performance and data ingestion speed. On top of that, we need to delete large amount of data fairly frequently:

  1. historical data needs to be removed and re-indexed regularly. the accounting team might take some time to confirm the cargo flow volume/revenue. so we have to regularly re-index historical data with the latest confirmed numbers for the past few months.
  2. predictions might get updated, in case something goes wrong in the predicted volumes in a particular week, we need to be able to remove the whole batch of prediction and re-index the predicted volumes

Obstacles