The log/event processing pipeline you can't have

← February 2019

March 2019 →

Let me tell you about the still-not-defunct real-time log processing pipeline we built at my now-defunct last job. It handled logs from a large number of embedded devices that our ISP operated on behalf of residential customers. (I wrote and presented previously about some of the cool wifi diagnostics that were possible with this data set.)

Lately, I've had a surprisingly large number of conversations about logs processing pipelines. I can find probably 10+ already-funded, seemingly successful startups processing logs, and the Big Name Cloud providers all have some kind of logs thingy, but still, people are not satisfied. It's expensive and slow. And if you complain, you mostly get told that you shouldn't be using unstructured logs anyway, you should be using event streams.

That advice is not wrong, but it's incomplete.

Instead of doing a survey of the whole unhappy landscape, let's just ignore what other people suffer with and talk about what does work. You can probably find, somewhere, something similar to each of the components I'm going to talk about, but you probably can't find a single solution that combines it all with good performance and super-low latency for a reasonable price. At least, I haven't found it. I was a little surprised by this, because I didn't think we were doing anything all that innovative. Apparently I was incorrect.

The big picture

Let's get started. Here's a handy diagram of all the parts we're going to talk about:

The ISP where I worked has a bunch of embedded Linux devices (routers, firewalls, wifi access points, and so on) that we wanted to monitor. The number increased rapidly over time, but let's talk about a nice round number, like 100,000 of them. Initially there were zero, then maybe 10 in our development lab, and eventually we hit 100,000, and later there were many more than that. Whatever. Let's work with 100,000. But keep in mind that this architecture works pretty much the same with any number of devices.

(It's a "distributed system" in the sense of scalability, but it's also the simplest thing that really works for any number of devices more than a handful, which makes it different from many "distributed systems" where you could have solved the problem much more simply if you didn't care about scaling. Since our logs are coming from multiple sources, we can't make it non-distributed, but we can try to minimize the number of parts that have to deal with the extra complexity.)

Now, these are devices we were monitoring, not apps or services or containers or whatever. That means two things: we had to deal with lots of weird problems (like compiler/kernel bugs and hardware failures), and most of the software was off-the-shelf OS stuff we couldn't easily control (or didn't want to rewrite).

(Here's the good news: because embedded devices have all the problems from top to bottom, any solution that works for my masses of embedded devices will work for any other log-pipeline problem you might have. If you're lucky, you can leave out some parts.)

That means the debate about "events" vs "logs" was kind of moot. We didn't control all the parts in our system, so telling us to forget logs and use only structured events doesn't help. udhcpd produces messages the way it wants to produce messages, and that's life. Sometimes the kernel panics and prints whatever it wants to print, and that's life. Move on.

Of course, we also had our own apps, which means we could also produce our own structured events when it was relevant to our own apps. Our team had whole never-ending debates about which is better, logs or events, structured or unstructured. In fact, in a move only overfunded megacorporations can afford, we actually implemented both and ran them both for a long time.

Thus, I can now tell you the final true answer, once and for all: you want structured events in your database.

...but you need to be able to produce them from unstructured logs. And once you can do that, exactly how those structured events are produced (either from logs or directly from structured trace output) turns out to be unimportant.

But we're getting ahead of ourselves a bit. Let's take our flow diagram, one part at a time, from left to right.

Userspace and kernel messages, in a single stream

Some people who have been hacking on Linux for a while may know about /proc/kmsg: that's the file good old (pre-systemd) klogd reads kernel messages from, and pumps them to syslogd, which saves them to a file. Nowadays systemd does roughly the same thing but with more d-bus and more corrupted binary log files. Ahem. Anyway. When you run the dmesg command, it reads the same kernel messages (in a slightly different way).