Common data model mistakes made by startups

After helping build out a few startup’s analytics stacks, one starts to see some patterns. Sometimes these patterns are happy ones. Just about everyone loves the moment when you go from having no idea what’s going on, to having a foggy idea of what happened last week.

Other patterns are less awesome, and usually involve decisions around data models or schemas.

It’s important to note that the anti-patterns we’ll discuss below are specific to startups. Some of these patterns are actually good ideas for later-stage companies, but for small, pre-product-market-fit, resource-constrained startups, these are mistakes you don’t need to make.

Without further ado, here are the top five sources of pain in early-stage analytics:

1. Polluting your database with test or fake data

Whether it’s test accounts, staff accounts, different data programs, or orders that come in through feline telepathy, too many companies include data that require you to ignore certain events or transactions in many or most of your queries.

By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company. You can balance this tax against transactional efficiency, or developer productivity. Sometimes this tax is worth it, sometimes it isn’t. For large companies, transactional efficiency is an important enough goal that you can afford to spend a couple engineers’ or analysts’ time to clean up the results.

If you’re small, chances are you can’t afford this, and you should probably make the tradeoff somewhere else.

2. Reconstructing sessions post hoc

A sizable portion of the important questions around user behavior, satisfaction, and value revolve around session metrics. Whether they are called “sessions,” “conversations,” “support contacts,” or something else, these metrics refer to a number of discrete events related to a user that should be grouped together and treated as a single concept. It is, however, frighteningly common for startups’ data models to fail to capture this basic concept in the vocabulary of a business.

Sessions are often reconstructed post hoc, which typically results in a lot of fragility and pain. The exact definition of what comprises a session typically changes as the app itself changes. Additionally, there is often a lot of context around a user’s session on the client, or the server processing the client’s requests. It is far easier to assign a session, support ticket, or conversation ID in your app than it is to try to reconstruct a session after the fact.

3. Soft deletes

At scale, deleting rows in a database under significant load is a Bad Thing. Soft deletes are a common schema tool that alleviate the performance hits of deletion, and the subsequent compaction (or vacuuming). Additionally, soft deletes make it easy to un-delete rows, or (in theory) recover deleted data.

On the flip side, soft deletes require every single read query to exclude deleted records. If you consider just the application data calls, this might not seem so bad. However, when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag. That, and soft deletes introduce yet another place where different users can make different assumptions, which can lead to inconsistent numbers that you’ll need to debug.

4. Misusing semi-structured data

Semi-structured data (e.g., fields encoded as JSON) can be useful in situations where there are a number of different structures present over time. As databases get larger, semi-structured data can also help avoid the hassles of migrating large tables under heavy read or write load.

However, semi-structured data can also lead to a lot of heartburn when trying to get data out of a database. Typically semi-structured data have schemas that are only enforced by convention, that might change unpredictably, or be off due to transient bugs, and in general that require a lot of post-hoc cleaning to be useful.

And sometimes semi-structured data fields are an excuse to punt on thinking through the structure you need until after you’ve written a feature. In this case, you actually have structured data, it’s just unenforced, prone to bugs, and generally a pain to use. A simple test: if a JSON field has the same 4 fields, you should probably decompose the structure.

5. The “right database for the job” syndrome

The presence of a multitude of different databases used in a company’s technology stack usually indicates one of three scenarios: