Authors
This paper introduces a dataset of 72,081 wood-engraved images extracted from the Illustrated London News (ILN) from the years 1842 to 1890. In the mid-19th century, the ILN revolutionized news consumption by combining text with high-quality wood-engraved illustrations published at scale. While digitization has facilitated text-based analysis of historical periodicals, visual content remains challenging to explore systematically. We address this gap by providing a large-scale dataset of 19th-century news illustrations and their multimodal embeddings. Our methodology involved six steps: 1) Collecting 56,699 scanned ILN pages from the Internet Archive; 2) Annotating 908 pages to finetune a YOLOv8 object detection model; 3) Using the finetuned model to extract illustrations; 4) Applying an Open-CLIP model to generate multimodal embeddings; 5) Using Tesseract OCR to convert illustration captions into machine-readable text; 6) Developing a Flask application for text and image-based multimodal retrieval. The resulting dataset and application allow flexible analysis of 19th-century visual representations of news, and suggest new avenues for research in computational humanities, media history, periodical studies, and visual culture studies. By releasing the dataset, the project code, and the embeddings, our project aims to facilitate similar efforts with other historical materials, contributing to a broader understanding of visual culture. At the same time, this paper also underscores the limitations of interpreting historical imagery with modern AI models, identifying the potential effects of bias and interpretive distortion.
Keywords:
Year: 2025
Page/Article: 10
Submitted on Nov 18, 2024
Accepted on Jan 21, 2025
Peer Reviewed