This is an (noncomprehensive) list of ideas and thoughts about the current state of “session” maintenance in Crawlee - and what we can do about it in Crawlee v4.

<aside> 💡

TLDR

Before adding yet another class (UserPool) for manipulating the crawler behaviour, we should imo try to see whether it makes sense to add these features somewhere in the existing classes. In such case, refactoring would be necessary, not to just add yet another patchy feature.

My personal favourite for refactoring is the Request / Session combo. While it works (somehow), the integration of the Session class into the whole system feels awkward at points.

Dropping Session altogether (see this paragraph to see why it’s not that big of a deal) and moving its internals to Request might help us with designing a well-defined “session”-handling model that’s robust, yet simple enough to understand for users.

</aside>

High-level view

We want to simulate the behaviour of actual users visiting the sites. While we already do this on the “single request” basis (we are simulating the fingerprints of their browsers), we haven’t given much attention to the long-term behaviour (following a trend between different requests).

Related reading

https://github.com/apify/crawlee/issues/796

https://github.com/apify/crawlee/issues/1573

https://github.com/apify/crawlee-python/issues/1081

https://github.com/apify/crawlee/pull/3048

https://jindrich.bar/misc/userpool-rfc

Functional requirements