Most functional data teams today comprise of a split between Data Scientists and Data Engineers that typically encompasses the following skillsets:

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/10dcad6d-4703-4996-b759-cf30064a916d/figure.png

However, the ML/data role has evolved significantly the past few years due to more and more of the grunge work being abstracted away. This is part of a larger trend where software is eating software, and has allowed an unprecedented level of technical abstraction afforded to engineers.

There is the seminal 'Engineers Shouldn’t Write ETL' piece which talks about merging the two traditional data engineer and data science roles into a hybrid. We see this emerging externally via the 'Machine Learning Engineer' title, and at Google I was embedded in a similar team where one was expected to go from an atomic piece of data to a productionized ML model: a new end-to-end engineer.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/61260ce3-a727-4ffa-b7fe-89ff58143a3f/figure-2.png

Not only would this class of engineer be tasked with productionizing ML models, but their flow contains exploratory data analysis, developing pipelines for feature engineering, investigating DB inconsistencies.

5-10 years back, to even think about productionizing an ML model at some scale, you'd need to spin up Hadoop clusters on your machines, and employ people that could code up map-reduce pipelines as their only job. You'd also need dedicated folks to manage your ETLs and to set up and maintain your data warehouse. Now, we can use tools such as Fivetran, which completely avoids the need of setting up your own ETLs into your warehouse. To deploy pipelines that can process data at scale, we can use the accessible Prefect + Dask stack (all in Python too!).

I observed this trend at Google as well, where a lot of the internal Map-Reduce libraries were sunset to introduce Flume: a data pipeline framework which eliminated a lot of the boiler plate and unnecessary knobs to allow engineers to code up robust real-time and batch pipelines quickly. You can see this now in Google Cloud's Dataflow offering as well.

Now, the beauty is that the new breed of data tools (mentioned above), allow an engineer to only require Python and SQL skills: everything else is about interfacing with a tool. You also don't need a thorough understanding of deep learning to train and deploy a model on Keras that can provide immense value to the business (e.g a classifier to detect ID documents). Of course, if you eventually want to chase the incremental F1 score gains, you might need a researcher to develop custom models. This would also only arise when your company starts processing petabytes of data, but till then, decision trees and regressions should suffice for most tasks.

What this leaves us with is: engineers can focus on solving the interesting problems, and leave a lot of the routinely boring chores to be eaten up by abstraction.