Data modeling basics

Intro

Data modeling questions are a fairly normal part of the interview process, and it is worth knowing some basic guidelines for how to organize data. Although in this piece I will primarily talk about this at the database level, it is good knowledge to have across the stack - knowing when a piece of logic should be broken out into a reusable component is something frontend engineers will frequently run into, and it gets into similar logical territory because we are thinking about how it makes sense to organize our information, what belongs together, and what should be broken up.

Because it is the predominant type of database, we are mainly going to talk about relational databases and how to model data using good practices for relational databases specifically. But it is worth knowing that there are many other types of database out there, and what count as “good practices” can be different between different database systems. In addition to relational databases, there are:

NoSQL
Key/value stores
Graph databases
Network databases
Object oriented databases
Columnar databases

Among others. Before you feel pressure to learn all that, it is worth saying that I’ve been doing this a long time and have really only ever significantly dealt with relational databases, key value stores, and NoSQL. There is also some crossover in functionality between databases, at times. One of my favorite things about MongoDB is the ability to store data structures similar to JSON, but PostgreSQL, a relational database, also allows you to store JSON if you need to. Some NoSQL databases also support query languages like SQL (structured query language), the language of relational databases.

Basic ideas

I feel like data modeling is one of those terms that can sound intimidating before you know what it is. I had several of these as a junior developer where when I would hear the term, for some reason my mind would go blank. (”Scripting language” was one of these, even though all I knew was scripting languages 🤷‍♀️).

If hearing data modeling makes you think of training AI models, or complex statistical models like they do for political elections, you can put that to the side. You should be happy to know, data modeling for relational databases more or less follows common sense.

Relational databases and SQL have actually come to be a favorite technology of mine, because the principles so closely echo common sense. It is also somewhat unique for a technology as old as relational databases to have survived the rapidly changing technological landscape since their introduction by E.F. Codd between the 1960s-1980s. It is somewhat rare for a technology so old to still work so well and be so well liked by the people using it. The history behind this technology is interesting, and you can read more about it here.

If I could summarize the “common sense” ideas that govern good data modeling for a relational database for someone with zero background, I’d probably highlight the following:

We want to follow separation of concerns by breaking up data as much as possible into distinct tables
We want to avoid duplicating data, because this makes more work for us if we need to delete or update anything
We want the components of our tables to be clear and independently meaningful (the value in column B should not change in meaning in relation to column A, it should stand on its own)
We want to structure our data to have safeguards in place to avoid “junk data” - for instance, do we want to allow a user with a typo on a state abbreviation to send the value “NZ” instead of “NC” to the database? Probably not, because it may cause our application not to work as expected, and we may have to clean up bad data later