Python

Rust is rapidly replacing C as the “backend” workhorse langue for high performance Python packages — why? Let’s start with the motivating problem: Python is easy to write, but slow to run. It’s so slow that you can’t write high-performance libraries in pure Python, and in particular you can’t write data processing libraries that way. And yet, Python is the dominant language for ML and data engineering. So if you want to write a library for data engineers, ML engineers, etc., you’re going to find yourself in a situation where:

Historically, this meant that would-be library writers had to do one of two things:

  1. Learn to program in C, or
  2. Hope that somebody else learned to program in C, and that they wrote a library you can lean on for the low-level stuff.

color_fixed.png

“Well,” the C-philes may be asking, “what’s so bad about that?” Maybe most library authors can do get the job done by outsourcing the number crunching to numpy, scipy, etc. And in the handful of cases where it’s really necessary, surely they can learn a little C. It builds character.

But in practice, it doesn’t work well. Being able to outsource things to numpy, scipy, etc. is nice when it works, but having to vectorize every function & not being able to write for loops is a pain. Wondering whether something will be GIL-blocked is a pain. Etc., etc. Not everything you want to do fits neatly into existing libraries.

OK, so what about Option #2 — why not write the library in C, and then use something like ctypes or pybind to create the Python bindings? The problem is that if you’re coming from a Python background, programming in C is going to seem very low-level; it takes some work to learn the language. Null pointer dereferences, buffer overflows, memory leaks… these are just a few of the many fun Ways to Shoot Yourself in the Foot with C, all of which are foreign to native Python programmers.

If only there were a better way. If only there were a language that’s as fast and memory-efficient as C, but which didn’t require manual memory management or garbage collection. And if only that language had great Python tooling and a thriving community of existing developers. If only.

(“What about Jython?” I can hear the Java-heads saying. No. Stop it. Get some help.)

Rust

Rust is fast, Rust is memory-efficient. Rust makes parallel & concurrent programming easier. It has great tooling and a friendly compiler. It has a large, happy community of developers. Rust will let you run faster, jump higher, and make more friends at school.

Importantly for this article: It’s easier for Python developers to work with Rust than it is for them to work with C. It’s easy to bind Rust to Python with PyO3 (maturin). Rust is easier to pick up than C, and it’s easier for newcomers to write “safe” code.

As a result, over the last couple years, we’ve seen several high-performance libraries with Python frontends choose Rust for their backends. For example:

1**) Polars** is a fast, highly parallel, memory efficient library for working with DataFrames. Polars author Ritchie Vink considered several different language for Polars, and ultimately chose Rust. Here’s an excerpt explaining why, from a longer conversation:

I think data-engineering/science will remain dominated by a high level language that connects low-level compiled binaries. Multi-threading, performance in that host language doesn't matter, as the work will be passed down to the tool they dispatch to.

So we are down to either C , C++, Rust, Zig, or Fortran.

Zig is very young, but it might be well set up to become the new C.

Rust, IMO is well set up to become the new C++.

Of these languages, Rust has the best tooling. (Maybe Zig will get there.)

The borrow checker guarantees safe memory use AND safe concurrency. This together with its great tooling (crates.io, pyo3), makes it the best language to build low-level tooling in.

Because it is safe by default, new language learners can also easily start building tools and learn incrementally. I see this happening today, with a lot of Python users writing Rust, getting the job done, and being happy with the speedup they gain.

In short, I think Rust has the best correctness guarantees and is a modern systems language

2) Lance is a high-performance, low-cost vector database. The founders of Lance, Chang She and Lei Xu, originally wrote the codebase in C++, and decided to switch to Rust later, even though the team had plenty of C development experience. Here’s Chang’s explanation why: