Multi-column support for Python Architect

tl;dr we are releasing our version of Architect with support for multi-column partition key.

In our previous blog post, we talked about how we partition our Postgres database in order to scale it better. However, we did not mention the limitations and we want to address those in this post and share our solutions.

We mentioned that a proper time series database like timescale supports the concept of time-space partitioning. This is especially important in a multi-tenant SaaS platform. The more customers you have, the more data you will be collecting for a fixed time interval and at one point that will hit a scaling bottleneck. In the context of a SaaS platform, since each customer will only be accessing its own data, it intuitively makes sense to partition the time series data based on both timestamp and customer ID. The partition based on customer ID is what we call a space partition.

Problem statement

Architect allows us to partition a database table based on a partition key, this partition key could be either a time partition(based on timestamp) or a space partition(based on integer range/string) but not both. This creates a critical limitation. Putting the multi-tenant case aside, in our case, since we are generating rolling forecasts, the time series data size of each partition will grow linearly with respect to the rolling window.

Using the graph below to illustrate this, where we are generating rolling forecasts every week. In the graph, we have a rolling window of 15 weeks. Prediction creation week indicates when the prediction is generated and week indicates which week we are predicting for. We could either partition based on "Prediction Creation Week" or "Week". Either way, as illustrated in the graph, the amount of data in a single partition would reach 15 weeks' worth of data. This means we will be limited by the rolling window and we will hit a bottleneck as we start to generate longer term forecasts.

Therefore, we need a way of partitioning based on both prediction creation week and week.

Multi-column partition key

Multi-column partitioning allows us to specify more than one column as a partitioning key. In our case, each batch of prediction generated is identified by a unique id called source_file_id. We will try to partition our prediction based on both date and source_file_id.

PostgreSQL’s partitioning implementation in Architect is done purely at the database level. That means that Architect creates several triggers and functions and inserts them directly into the database.

We modified Architect's PostgreSQL triggers to facilitate multiple partition keys. The implementation is still done purely at the database level so even if you create new entries in the database table directly using sql, it will handle all the partition related business automatically including creating new partitions if necessary. You could see the difference between the sample triggers generated for single-column and multi-column here.