Summary

Chapter 9: Design a Web Crawler Summary

This chapter covers the design of a scalable web crawler system for search engine indexing, capable of collecting 1 billion HTML pages per month.

Key Requirements & Constraints

Scale: 1 billion pages/month (~400 QPS, 800 peak QPS)
Storage: 500TB/month, 30PB for 5 years
Core principles: Scalability, robustness, politeness, and extensibility

High-Level Architecture

The system consists of several key components working together:

Seed URLs: Starting points divided by locality or topic
URL Frontier: FIFO queue managing URLs to download
HTML Downloader: Fetches web pages via DNS-resolved URLs
Content Parser & Storage: Validates and stores content (memory for popular, disk for others)
Deduplication: "Content Seen?" and "URL Seen?" components using hash values or bloom filters
URL Extractor & Filter: Extracts links and excludes blacklisted/unwanted content

Critical Design Decisions

URL Frontier Design handles three major concerns:

Prioritization: Uses PageRank, traffic, and update frequency to assign URLs to priority queues
Politeness: Downloads one page at a time per host with delays, using host-to-queue mapping
Freshness: Periodically recrawls based on update history

Performance Optimizations:

Distributed crawling across multiple servers
DNS caching to reduce 10-200ms latency
Geographic distribution of components
Short timeouts for unresponsive servers

Robustness Measures:

Consistent hashing for load distribution
State persistence for failure recovery
Exception handling and data validation

Problematic Content Handling:

Hash-based redundant content detection
Spider trap prevention with URL length limits and blacklists
Filtering of ads, spam, and noise

Introduction

A web crawler is used to:

search engine indexing (Googlebot)
web archiving
web mining
web monitoring

Step 1. Understand the Problem and Establish Design Scope

Basic algorithm:

Given a set of URLs, download all the web pages addressed by the URLs.
Extract URLs from these web pages.
Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps.

<aside> 💬

1. Main purpose

Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else?

Interviewer: Search engine indexing.

2. Scale

Candidate: How many web pages does the web crawler collect per month?

Interviewer: 1 billion pages.

3. Content type

Candidate: What content types are included? HTML only or other content types such as PDFs and images as well?

Interviewer: HTML only.

4. Web page types

Candidate: Shall we consider newly added or edited web pages?

Interviewer: Yes, we should consider the newly added or edited web pages.

5. Storage

Candidate: Do we need to stroe HTML pages crawled from the web?

Interviewer: Yes, up to 5 years.

6. Duplication

Candidate: How do we handle web pages with duplicate content?

Interviewer: Pages with duplicate content should be ignored.

</aside>

Other noteworthy factors include:

Scalability: Parallelization would benefit the crawler greatly with high efficiency.
Robustness: The crawler must handle bad HTML, unresponsive servers, crashes, malicious links, …etc
Politeness: The crawler should not make too many requests to a website within a short time interval.
Extensibility: The system should be flexible to be able to handle new content with minimal changes.