TECH ASSESSMENT

Imagine that YOU are aMachine Learning Research Scientist (PAUSED) in our data science team who is collaborating with our data engineering team.

Overview

The data engineering team has done some research and found that the Google Trends data is potentially beneficial to the data science team. YOU, as a data scientist, want a time series of consistent Google Trends data from 2017 till the present with hourly interval. YOU informed the engineering team of this requirement, but they said they could not fetch the hourly data directly. The reason why they are unable to fetch the hourly data directly is explained in the Deep Dive section. They may, however, fetch the following raw data from Google Trends:

What do YOU need to do?

Deep Dive

The data engineering team fetched the raw data from Google Trends by way of web scraping from its website (as linked here).

As you can see, the Google Trends website offers an drop-down box for you to choose Custom time range (e.g. From 2004 to present”)

As you can see, the Google Trends website offers an drop-down box for you to choose Custom time range (e.g. From 2004 to present”)

The engineering team found that by choosing a time range of 2017-present, they could only provide time series of consistent Google Trends data with time interval of months (downloadable as monthly_data.csv in the Raw Data section):

Untitled

In order to get the time series of hourly interval, they were forced to work within a more constrained time range (i.e. a week). They are able to get a time series of hourly data from 2017 up to the present (downloadable as hourly_data.csv in the Raw Data section) by retrieving and concatenating week-range-data on a week-by-week basis.

Untitled

However, this hourly data are not what YOU want, since the data are not consistent!

Google scales the trends data within the window range you choose. In other words, say for example, a value_hour that equals ‘50’ during the week from 2022-07-03 to 2022-07-09 are not the same as a value_hour that also equals ‘50’ during the week from 2022-07-17 to 2022-07-23.

Untitled

Only the value_hour numbers that sit within the same week are consistent.

Similarly, to get the time series of weekly interval (downloadable as weekly_data.csv in the Raw Data section), the engineering team used a time range of a month. They fetched month-range-data and concatenated them month by month from 2017 till the present.

Untitled

By the same token, only the value_week in the same month are consistent.

Problem

With monthly_data.csv, weekly_data.csvand hourly_data.csvdata files given to you by the engineering team, how do you use them to output time series of consistent Google Trends data from 2017 till the present with time interval of hours?

Write a Python script to solve this problem using the time series files downloadable from the Raw Data section below.

Raw Data

hourly_data.csv

weekly_data.csv

monthly_data.csv

SUBMIT YOUR ANSWER

Upload program code (or pseudo code) file(s) along with the README file for this TTA to GitHub, and send the repository link to careers+data_representation_tta@eonlabs.com

You are always welcomed to ask questions that you may have about this TTA by sending email to to careers+data_representation_tta@eonlabs.com so that our engineering team may answer your questions.

@May @Chen Li

Hi Terry and careers team,

Thank you for sharing the tech challenge with me.

I have a set of questions and I appreciate it if you provide me with some clarifications.

  1. the time window of the available data is from 2017-2022, how could I provide the normalized data from 2007 till the present?
  2. I am still not sure what exactly the question is? Data values for all, monthly, weekly and hourly are in [0-100] range and as is mentioned in the Detail part of the question, they are normalized. What exactly does the team need to know?

It would be great if I can have a (video/voice) chat regarding my questions.

Best

Mandana