A4D Code hand-over

there is example data in the repository (2_Data) that contains fake data that Tyla provided for us to show what the data looks like
- different Trackers for different years, differences between the years and sometimes even within a year between clinics
- DC_V2 is the new format?
- we only have pseudoanonymized data for 2017 - 2019 for some countries
  - in the bucket in GCP Storage!
  - Q: how to integrate data within R code → libraries?
    - A: everybody used an encrypted USB stick and used the data locally
- our task will be to adapt these scripts to 2022 data
- current pipeline works with the real data in the bucket, not with the dummy data
important: Patient AN Data (Tyla is working on this anonymization, Patient ID)
- for each month of the year we have one tab (worksheet) in each tracker
  - sometimes there is clinic information at the top
  - product information about medical supplies handed out to the clinics
  - patient data section
    - data is copied over from month to month so not all measurements changed in the current month
    - first step: extract all information for each month and each patient and put it together in one big format
10_Run_a4d_tracker_extraction: main file that contains the main pipeline and outputs the cleaned data frame
- always open the RProject so that relative paths are correct
- codebook is the mapping file
- working with the column synonyms of the codebook
  - reading the synonyms of the data
- df_raw contains the raw data as in the trackers → always the same format but with empty values if a variable is not available
  - reading_a4d_tracker function:
    - go through each month sheet
    - parse year data
    - extract clinic data
    - extract patient data → parse everything that starts with the patient recruitment summary table
    - extract tracker columns
- df_cleaned is the cleaned version of this raw data → just a wrapper for each column
  - e.g. identifies unrealistic hba1c values
Codebook
- different versions of the data/tracker we had
- document as best as possible what are the different inputs to the product and patient table
- Version_TrackerMerged: is the description for the data we want to get out after each version → format extracted data frame with the correct names
- synonyms for patient and product data contains all synonyms of the variables
  - if new synonyms pop up, we could add them to the excel
output:

Untitled

structured data in database
- in the end we have the processed data file, but our code ends at the step raw, cleaned data (this is were Teresa’s data model come into play)
  - we want to have a list of patients, clinics, products, visits, and given products
- so both ends exist but not the full pipeline
Tracker Template:
- medical supplies: different products, monitor how many products have been handed out
- balance: total amount that they still have, but has been introduced only in later tracker → so we had to include this for earlier trackers ourselves
- code:
  - two R scripts: main R scripts is Preprocessing_MSD_refactored
    - takes as input each of the tracker files
    - process everything and produce one data frame with the product data for this year and this clinic for all the months and products
    - helper_product_data is the most relevant helper scripts
    - makes use of Codebook to harmonize the different column names
  - same logic regarding the parsing, go over the month sheets, work with synonyms etc
    - see synonyms_ProductData (similar to PatientData)
  - make sure to only extract product data!
  - a lot of recoding of the columns to standardize the format
  - empty dates: assumption that if balance stays the same nothing happend, but sometimes things changes and we copy the last given date down (forward fill)
  - struggled most with column received and units received
  - recalculated the balance in some trackers, mostly in the earlier trackers
  - some trackers contained several products in the same cell (separated by “;” for example) → tried to divide this in different cells (see extract_product_multiple)
  - need to be careful with balance of products between months —> not checked at this moment in the script! (e.g. product balance end of January might not match balance in the next sheet for February)
    - where do additional products would come from? maybe A4d, …
Codebook is adjusted to the real data! But we didn’t have any data for 2021/2022 so the features/names for these years are still only based on the example data
data cleaning:
- one part can only be done by Tyla because only she has access to the real unanonymized data
- she will hand-over the pseudoanonymized and encoded data
- our cleaning is basic cleaning
Q: future technologies?
- A: talked about using basic dashboard tools (R shiny rhino?), did not go into it further
- A: basic insides needed; aggregations over time
- have we to deal with everything? No, we can ask A4D to standardize the data so that it is ready/more easy to process it BUT A4D has not much resources and at some point it will just not be feasable for Tyle to do it