Summary
Chapter 9: Design a Web Crawler Summary
This chapter covers the design of a scalable web crawler system for search engine indexing, capable of collecting 1 billion HTML pages per month.
Key Requirements & Constraints
- Scale: 1 billion pages/month (~400 QPS, 800 peak QPS)
- Storage: 500TB/month, 30PB for 5 years
- Core principles: Scalability, robustness, politeness, and extensibility
High-Level Architecture
The system consists of several key components working together:
- Seed URLs: Starting points divided by locality or topic
- URL Frontier: FIFO queue managing URLs to download
- HTML Downloader: Fetches web pages via DNS-resolved URLs
- Content Parser & Storage: Validates and stores content (memory for popular, disk for others)
- Deduplication: "Content Seen?" and "URL Seen?" components using hash values or bloom filters
- URL Extractor & Filter: Extracts links and excludes blacklisted/unwanted content
Critical Design Decisions
URL Frontier Design handles three major concerns:
- Prioritization: Uses PageRank, traffic, and update frequency to assign URLs to priority queues
- Politeness: Downloads one page at a time per host with delays, using host-to-queue mapping
- Freshness: Periodically recrawls based on update history
Performance Optimizations:
- Distributed crawling across multiple servers
- DNS caching to reduce 10-200ms latency
- Geographic distribution of components
- Short timeouts for unresponsive servers
Robustness Measures:
- Consistent hashing for load distribution
- State persistence for failure recovery
- Exception handling and data validation
Problematic Content Handling:
- Hash-based redundant content detection
- Spider trap prevention with URL length limits and blacklists
- Filtering of ads, spam, and noise
Introduction
A web crawler is used to:
- search engine indexing (Googlebot)
- web archiving
- web mining
- web monitoring
Step 1. Understand the Problem and Establish Design Scope
Basic algorithm:
- Given a set of URLs, download all the web pages addressed by the URLs.
- Extract URLs from these web pages.
- Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps.
<aside>
💬
1. Main purpose
Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else?
Interviewer: Search engine indexing.
2. Scale
Candidate: How many web pages does the web crawler collect per month?
Interviewer: 1 billion pages.
3. Content type
Candidate: What content types are included? HTML only or other content types such as PDFs and images as well?
Interviewer: HTML only.
4. Web page types
Candidate: Shall we consider newly added or edited web pages?
Interviewer: Yes, we should consider the newly added or edited web pages.
5. Storage
Candidate: Do we need to stroe HTML pages crawled from the web?
Interviewer: Yes, up to 5 years.
6. Duplication
Candidate: How do we handle web pages with duplicate content?
Interviewer: Pages with duplicate content should be ignored.
</aside>
Other noteworthy factors include:
- Scalability: Parallelization would benefit the crawler greatly with high efficiency.
- Robustness: The crawler must handle bad HTML, unresponsive servers, crashes, malicious links, …etc
- Politeness: The crawler should not make too many requests to a website within a short time interval.
- Extensibility: The system should be flexible to be able to handle new content with minimal changes.
Picture of the design