This is an (noncomprehensive) list of ideas and thoughts about the current state of “session” maintenance in Crawlee - and what we can do about it in Crawlee v4.
<aside>
💡
TLDR
Before adding yet another class (UserPool
) for manipulating the crawler behaviour, we should imo try to see whether it makes sense to add these features somewhere in the existing classes. In such case, refactoring would be necessary, not to just add yet another patchy feature.
My personal favourite for refactoring is the Request
/ Session
combo. While it works (somehow), the integration of the Session
class into the whole system feels awkward at points.
Dropping Session
altogether (see this paragraph to see why it’s not that big of a deal) and moving its internals to Request
might help us with designing a well-defined “session”-handling model that’s robust, yet simple enough to understand for users.
</aside>
High-level view
We want to simulate the behaviour of actual users visiting the sites. While we already do this on the “single request” basis (we are simulating the fingerprints of their browsers), we haven’t given much attention to the long-term behaviour (following a trend between different requests).
Related reading
https://github.com/apify/crawlee/issues/796
https://github.com/apify/crawlee/issues/1573
https://github.com/apify/crawlee-python/issues/1081
https://github.com/apify/crawlee/pull/3048
https://jindrich.bar/misc/userpool-rfc
Functional requirements
- Introduce the concept of “users” that will allow
- Maintaining the proxy session (accessing the website from the same IP address)
- proxy session essentially means just the URL
- Maintaining the HTTP state (sending the same set of cookies for the related requests)
- the cookie jar of the user should be updated with cookies received from crawled pages
- With browsers, state is stored in
localStorage
, sessionStorage
, indexedDB
… BrowserCrawler users might need to own the running browser instance
- Maintaining a consistent browser fingerprint throughout the whole run
- Pacing the requests coming from the same user (
sameDomainDelaySecs
, but better)
- Users should be fully (de-)serializable for persistence between migrations.
- It should be possible to “rotate” the user in case we get blocked
- This means that all requests of the discarded user will be made by a new “stand-in” user.
- It should be possible to associate newly created requests (in
enqueueLinks
and addRequests
) with a user (the current one?)
- with this, the navigation pattern should appear more human-like
- open question - is this worth it? it would seem that most of our users can crawl stuff just fine without the feature
- open question - is parallel crawling acceptable in this case?
- open question - how to handle reassigning the requests if a user is retired (because of blocking)?
- Ideas:
- Replace the user (keeping the same id), so requests still point to a valid user
- If we want to keep the retired user (possible memory leak, though), maybe open addressing? I.e. if a request’s user is retired, use some predefined probe sequence (same for all requests) to look for new users (creating the user if finding
undefined
)
- A Crawlee user should be able to configure what the fingerprint should look like (mobile only, no Windows users, …)
- The function for creating new users should be configurable
- the pool should never “run out” of users
- (?) It should allow for example logging in into a web application