The End-to-End Machine Learning Engineer

by Samee Zahid, Engineering Director

At Chipper, our ML engineers in the Intelligence Org operate across our data, backend and native stack. We’ve seen this to be a powerful paradigm to enable ML-first development across the company — to faster productionization + decisioning around risk scoring, customized and smart growth campaign tooling, a more valuable in-house onboarding KYC suite and much, much more.

Some background…

Most functional data teams today comprise of a split between Data Scientists and Data Engineers, that typically encompass amongst the following skillsets:

credits to O’Reilly Media

                                           *credits to O’Reilly Media*

However, the ML/data role has evolved significantly the past few years due to more and more of the grunge work being abstracted away. This is part of a larger trend where software is eating software, and has allowed an unprecedented level of technical abstraction afforded to engineers.

There is the seminal 'Engineers Shouldn’t Write ETL' piece which talks about merging the two traditional data engineer and data science roles into a hybrid. We see this emerging externally via the 'Machine Learning Engineer' title, and at Google I was embedded in a similar team where one was expected to go from an atomic piece of data to a productionized ML model: a new end-to-end engineer.

Not only would this class of engineer be tasked with productionizing ML models, but their flow contains exploratory data analysis, developing pipelines for feature engineering, investigating DB inconsistencies.

5-10 years back, to even think about productionizing an ML model at some scale, you'd need to spin up Hadoop clusters on your machines, and employ people that could code up map-reduce pipelines as their only job. You'd also need dedicated folks to manage your ETLs and to set up and maintain your data warehouse. Now, we can use tools such as Fivetran, which completely avoids the need of setting up your own ETLs into your warehouse. To deploy pipelines that can process data at scale, we can use the accessible Prefect + Dask stack (all in Python too!).

I observed this trend at Google as well, where a lot of the internal Map-Reduce libraries were sunset to introduce Flume: a data pipeline framework which eliminated a lot of the boiler plate and unnecessary knobs to allow engineers to code up robust real-time and batch pipelines quickly. You can see this now in Google Cloud's Dataflow offering as well.

Now, the beauty is that the new breed of data tools (mentioned above), allow an engineer to only require Python and SQL skills: everything else is about interfacing with a tool. You also don't need a thorough understanding of deep learning to train and deploy a model that can provide immense value to the business (e.g a classifier to detect ID documents). Of course, if you eventually want to chase the incremental F1 score gains, you might need a researcher to develop custom models. But till then, decision trees and regressions should suffice.

What this leaves us with is: engineers can focus on solving the interesting problems, and leave a lot of the routinely boring chores to be eaten up by abstraction.