<aside> ✏️ 21.06.2020 by Darjan Salaj [ web | GoogleScholar | LinkedIn | GitHub ] Update 24.06.2020: Thanks to BadInformatics for pointing me to ODE based models and other remarks. The content was extended accordingly. Update 22.08.2020: Extended the details on missing values; Replaced the first figure with example and code; Added tip for Traces library. Update 01.11.2021: Added Multi-Time Attention Nets and Neural Rough Differential Equations method
How to train neural networks on time-series that are non-uniformly sampled / irregularly sampled / have non-equidistant timesteps / have missing or corrupt values? In the following post, I try to summarize and point to effective methods for dealing with such data.
Time series where the time between the individual steps/measurements is not constant is called non-uniform or irregularly sampled. Irregularly sampled data occurs in many fields:
This is also common in cases where the data is multi-modal. Multi-modal input means that the input is coming from multiple different sources which most likely operate and take measurements not synchronized with each other resulting in the non-uniform input data.
Another source of the irregularity in sampling can be missing or corrupt values. This is very common in both automotive and EHR data where sensors malfunction or procedures are interrupted or rescheduled.
The first idea you might have is to interpolate and resample the time series data at hand. This will produce a uniformly sampled grid and allows for standard methods to be applied. However, this approach works under the strong assumption that the interpolated data behaves monotonically between the measurements. This assumption often does not hold which leads to undesired artifacts during feature transformation and in turn to a sub-par performance (sometimes worse than training on the raw data).
<aside> 💡 If you are looking for help in transforming unevenly-spaced times series to evenly-spaced ones, check out the Traces python library.
Another factor to consider before interpolating is the distribution of the sizes of timesteps. The easiest way to do this is to plot the histogram of dt-s (time differences) between data points. If the values of dt are relatively large, it is unlikely that the measurements in between the points behave monotonically. Furthermore, you should check the variance of dt and decide whether it is "sufficiently" uniform to apply the classical methods or not, as sometimes it turns out that the data is "uniform enough" (see the next section).
An example of dt distribution of a real world dataset (Gaia European space mission Data Release 2) with irregular time series is shown in the plot below. The plot is generated using this python notebook.
Another factor to consider is the computational and memory cost of interpolation which depends on both the size of your dataset (if you plan to do it offline) and the targeting granularity of the interpolation grid.
However, if you have a sufficiently large amount of data where you could learn the underlying distributions, then it might be worth looking into the methods for data recovery and super-resolution which I describe in the section below.