The million dollar engineering problem | Segment Blog

https://segment.com/blog/the-million-dollar-eng-problem/

For an early startup, using the cloud isn’t even a question these days. No RFPs, provisioning orders, or physical shipments of servers. Just the promise of getting up and running on “infinitely scalable” compute power within minutes.

But, the ability to provision thousands of dollars worth of infrastructure with a single API call comes with a very large hidden cost. And it’s something you won’t find on any pricing page.

Because outsourcing infrastructure is so damn easy (RDS, Redshift, S3, etc), it’s easy to fall into a cycle where the first response to any problem is to spend more money__.

And if your startup is trying to move as quickly as possible, the company may soon be staring at a five, six, or seven figure bill at the end of every month.

At Segment, we found ourselves in a similar situation near the end of last year. We were hitting the classic startup scaling problems, and our costs were starting to grow a bit too quickly. So we decided to focus on reducing the primary contributor: our AWS bill.

After a three months of focused work, we managed to cut our AWS bill by over one million dollars annually. Here is the story of how we did it.

Cash rules everything around me

Before diving in, it’s worth explaining the business reasons that really pushed us to build discipline around our infrastructure costs.

The costs for most SaaS products tend to find economies of scale early. If you are just selling software, distribution is essentially free, and you can support millions of users after the initial development. But the cost for infrastructure-as-a-service products (like Segment) tends to grow linearly with adoption. Not sub-linearly.

As a concrete example: a single Salesforce server supports thousands or millions of users, since each user generates a handful of requests per second. A single Segment container, on the other hand, has to process thousands of messages per second–all of which may come from a single customer.

By the end of Q3 2016, two thirds of our cost of goods sold (COGS) was the bill from AWS. Here’s the graph of the spend on a monthly basis, normalized against our May spend.

Our infrastructure cost was unacceptably high, and starting to impact our efforts to create a sustainable long-term business. It was time for a change.

Getting a lay of the land

If the first step in cost reduction is “admitting you have a problem”, the second is “identifying potential savings.” And with AWS, that turns out to be a surprisingly hard thing to do.

How do you determine the costs of an environment that is billed hourly with blended annual commits, auto-scaling instances, and bandwidth costs?

There are plenty of tools out there that promise to help optimize your infrastructure spend, but let’s get this out of the way: there is no magic bullet.

In our case, this meant digging through the bill line-by-line and scrutinizing every single resource.