Table of Contents

What is dlt and Why It Matters

If you've spent any time building data pipelines, you know the drill: write a script to hit an API, flatten the JSON manually, handle pagination, manage credentials, deal with schema changes, and somehow make it all production-ready. It works, until it doesn't.

image.png

**dlt (data load tool)** is an open-source Python library built to solve exactly this. Instead of writing infrastructure code from scratch every time, dlt gives you a framework to build reliable, scalable ingestion pipelines with just a few lines of Python.

dlt lives in the EL part of ELT, it handles extraction and loading, leaving transformation to tools like dbt. It's not a SaaS connector platform like Fivetran or Airbyte. It's a library, meaning you own the code, you run it where you want, and you're not paying per row.

What Makes dlt Different

Three things stand out:

  1. Schema inference is automatic. dlt inspects your data and builds the schema for you, including nested structures. No more manually mapping JSON fields to table columns.
  2. It's just Python. No YAML-heavy configs, no proprietary DSL. If you can write a Python function, you can build a dlt source.
  3. Incremental loading out of the box. Tracking what's already been loaded, handling state, and avoiding duplicates are built into the framework, not something you bolt on later.

Core Concepts at a Glance

Before writing any pipeline, it helps to have a clear mental model of how dlt thinks about data movement. There are four building blocks you need to understand: sources, resources, destinations, and pipelines.

image.png

Data flows from a source, through one or more resources, gets processed by a pipeline, and lands in a destination. That's it.

Source

A source is a logical grouping of data, think of it as the origin system you're pulling from. It could be a REST API, a database, an S3 bucket, or a local file.

In code, a source is just a Python function decorated with @dlt.source:

@dlt.source
def github_source():
    return issues_resource(), pull_requests_resource()