The overall data is available in two states: as
raw data and/or as
pre-processed data. Additionally there are three reference tables for variable lookup.
The raw data was only processed if this was necessary for patient de-identification and otherwise left unchanged compared to the original source. The raw data contains the complete set of available variables (681 variables). It consists of the following tables:
The pre-processed data consists of intermediary pipeline stages from our original Nature medicine publication. Source variables representing the same clinical concepts were merged into one meta-variable per concept. The data contains the 18 most predictive meta-variables only, as defined in our publication. Two different stages of the pipeline are available
Merged stagesource variables are merged into meta-variables by clinical concepts e.g. non-opioid-analgesics. The time grid is left unchanged and is sparse.
Imputed stagethe data from the merged stage is down sampled to a five-minute time grid. The time grid is filled with imputed values. The imputation strategy is complex and is discussed in the original publication.
The code used to generate these stages can be found in this GitHub repo under the preprocessing folder.
The pre-processed data is intended mainly as a quick way to jump-start a project or for use in a proof of concept. We recommend using the source data whenever possible for regular projects. It is the most flexible form and contains the complete set of variables in the original time resolution.
Data is available in two formats:
CSV for wide compatibility and
Apache Parquet for convenience and performance. Parquet is a strongly typed, binary format that is supported by many major data processing tools such as