- there is example data in the repository (2_Data) that contains fake data that Tyla provided for us to show what the data looks like
- different Trackers for different years, differences between the years and sometimes even within a year between clinics
- DC_V2 is the new format?
- we only have pseudoanonymized data for 2017 - 2019 for some countries
- in the bucket in GCP Storage!
- Q: how to integrate data within R code → libraries?
- A: everybody used an encrypted USB stick and used the data locally
- our task will be to adapt these scripts to 2022 data
- current pipeline works with the real data in the bucket, not with the dummy data
- important: Patient AN Data (Tyla is working on this anonymization, Patient ID)
- for each month of the year we have one tab (worksheet) in each tracker
- sometimes there is clinic information at the top
- product information about medical supplies handed out to the clinics
- patient data section
- data is copied over from month to month so not all measurements changed in the current month
- first step: extract all information for each month and each patient and put it together in one big format
- 10_Run_a4d_tracker_extraction: main file that contains the main pipeline and outputs the cleaned data frame
- always open the RProject so that relative paths are correct
- codebook is the mapping file
- working with the column synonyms of the codebook
- reading the synonyms of the data
- df_raw contains the raw data as in the trackers → always the same format but with empty values if a variable is not available
- reading_a4d_tracker function:
- go through each month sheet
- parse year data
- extract clinic data
- extract patient data → parse everything that starts with the patient recruitment summary table
- extract tracker columns
- df_cleaned is the cleaned version of this raw data → just a wrapper for each column
- e.g. identifies unrealistic hba1c values
- Codebook
- different versions of the data/tracker we had
- document as best as possible what are the different inputs to the product and patient table
- Version_TrackerMerged: is the description for the data we want to get out after each version → format extracted data frame with the correct names
- synonyms for patient and product data contains all synonyms of the variables
- if new synonyms pop up, we could add them to the excel
- output:

- structured data in database
- in the end we have the processed data file, but our code ends at the step raw, cleaned data (this is were Teresa’s data model come into play)
- we want to have a list of patients, clinics, products, visits, and given products
- so both ends exist but not the full pipeline
- Tracker Template:
- medical supplies: different products, monitor how many products have been handed out
- balance: total amount that they still have, but has been introduced only in later tracker → so we had to include this for earlier trackers ourselves
- code:
- two R scripts: main R scripts is Preprocessing_MSD_refactored
- takes as input each of the tracker files
- process everything and produce one data frame with the product data for this year and this clinic for all the months and products
- helper_product_data is the most relevant helper scripts
- makes use of Codebook to harmonize the different column names
- same logic regarding the parsing, go over the month sheets, work with synonyms etc
- see synonyms_ProductData (similar to PatientData)
- make sure to only extract product data!
- a lot of recoding of the columns to standardize the format
- empty dates: assumption that if balance stays the same nothing happend, but sometimes things changes and we copy the last given date down (forward fill)
- struggled most with column received and units received
- recalculated the balance in some trackers, mostly in the earlier trackers
- some trackers contained several products in the same cell (separated by “;” for example) → tried to divide this in different cells (see extract_product_multiple)
- need to be careful with balance of products between months —> not checked at this moment in the script! (e.g. product balance end of January might not match balance in the next sheet for February)
- where do additional products would come from? maybe A4d, …
- Codebook is adjusted to the real data! But we didn’t have any data for 2021/2022 so the features/names for these years are still only based on the example data
- data cleaning:
- one part can only be done by Tyla because only she has access to the real unanonymized data
- she will hand-over the pseudoanonymized and encoded data
- our cleaning is basic cleaning
- Q: future technologies?
- A: talked about using basic dashboard tools (R shiny rhino?), did not go into it further
- A: basic insides needed; aggregations over time
- have we to deal with everything? No, we can ask A4D to standardize the data so that it is ready/more easy to process it BUT A4D has not much resources and at some point it will just not be feasable for Tyle to do it