https://blog.kowalczyk.info/article/88aee8f43620471aa9dbcad28368174c/how-i-reverse-engineered-notion-api.html

The company is working on an official API but I'm impatient.

I found a Python script that uses Selenium to recursively spider a Notion page and publish it to Firebase Hosting.

While it worked this approach is limited to getting a verbatim HTML of the pages as they are rendered by the Notion application.

I wanted to be able to change the look of the page, add elements like footers and headers and navigation bar.

I briefly considered trying to reconstruct the structure of the page from rendered HTML but at best that would be a lot of ugly guesswork.

Modern Single Page Applications (SPA) work by getting data from the server in structured format (most often JSON) and rendering HTML in the browser with JavaScript.

When loading a Notion page I saw XHR requests like /api/v3/getRecordValues and /api/v3/loadPageChunk.

Lucky for me the API is not obfuscated. It returns responses as JSON data. It isn't hard to figure out the meaning of fields.

Working with the original JSON structure is much easier that trying to reconstruct it from rendered HTML.

I could have looked at API requests between client and server in Chrome dev tools but it's not the best workflow.

Instead I wrote node.js script that logs all XHR requests that web browser makes when rendering a given page.

Some blocks have properties specific to that block type. For example a page block has title property.

To get the content of a page we start with its UUID which we can find out because it's last part of the URL of the page.

We can issue /api/v3/getRecordValues API to get list of blocks in the page and then /api/v3/loadPageChunk to get content of those blocks.

Majority of work was figuring out what kinds of blocks there are, how are they represented in JSON and writing code to to retrieve the data and present it in a format that is easier to work with than the raw data returned by the server.

Notion page consist of different kinds of blocks and we need to know how each block is represented in JSON response.