Looking at the current state of open-source & commercial solutions helping companies ensure data quality, you may have an impression that you can take two very separate approaches.

Data testing approach

Currently, the most popular approach to deal with data quality problems is data testing. The idea for data testing is quite simple: You should test your data the same way you would usually test the software by writing tests that will run on tables & files, etc., and fail if something unexpected happens.

There is a couple of popular open-source libraries helping you do that: Great Expectations, Deequ, dbt. (In dbt, data testing is just one of the features, other libraries are primarily designed for that).

Apart from more specialized solutions, there are other ways of running checks on your DB. For example, Apache Airflow has operators letting you run any SQL check. It doesn't give you Great Expectations style asserts - but when you think about it, many of them are pretty simple SQL scripts possible to write yourself.

Data monitoring approach

There is another set of solutions in the data quality/data observability space. If you ever heard of Monte Carlo / BigEye, those are SaaS solutions solving very similar problems (your data getting bad). They don't rely on data testing but statistics & ML in detecting problems. They usually mention that although data tests are good, they are not enough to ensure data quality. Here is one blog post from Monte Carlo about that. The two things mentioned in the article on which monitoring solutions are better than data testing:

There is an analogy to the software engineering world, when you most often need both tests and monitoring solution setup. There are possible problems you would encounter without that.

We believe there are more reasons why data tests may be not enough.

The world is not black and white

Data tests always assume that data either meets or doesn't meet some criteria. But sometimes, you want to follow specific metrics on your data and investigate yourself what exactly is happening (usually when alerted or after looking at visualization).

A simple example of a metric that many people want to track, but tests for it (even if written) are not the only indicator of a problem, is a daily total_row_count metric for your tables. Those may vary for many different reasons. Usually, to know if there may be an issue, you need to compare to past data, trends, etc. Testing alone can usually be useful for checking expected threshold of values, but just doing that is not giving you the whole story (you don't have history comparisons) plus, it's hard to set up thresholds this way so that they both:

Merging it

Ok, so summing this up, you may be convinced that you need both data testing & monitoring in your stack. Does it mean that you should deploy a pipeline using Great Expectations & buy a Monte Carlo solution?

In our opinion that's not necessary: although data testing & monitoring are both good, you don't need a separate solution for those.

re_data is a framework that combines both data testing & monitoring (we also don't want to stop here in regards to helping with data quality, but will speak about this at some other time).