Webrecorder Tools: Manual Crawl vs Browsertrix

Summary

Webrecorder is a suite of open sources tools and packages that allow for the capture of authenticated web content. These tools allow users to crawl websites and create a static, preserved record of that site in the form of a **WACZ** file. These files are zipped packages that include all of the information needed to recreate that webpage as it existed in that moment in time that can be viewed using ReplayWeb.page or a custom view that you build into a webpage.

There are also tools that exist to embed that snapshot in another website, as well as tools that allow for cryptographic signing of those WACZ records to be used to validate the authenticity of those files down the road.

There are two options for collecting the WACZ files. The first is to to install a chrome extension called ArchiveWeb.page ****and manually click through each page. The second option is to use the Browsertrix crawler ****tool to create the files that can be signed and stored as an archive.

Browsertrix

There are two options for using the Browsertrix crawler to create web archive files: The Browsertrix Cloud or the Browsertrix Crawler CLI tool

Browsertrix Cloud

The Browsertrix Cloud allows the user to configure options and run automated crawls within a WebUI, configuring options for what and how you want to crawl a webpage, as well as provides support for signing of .wacz files by Starling Labs with Authsign, certifying that web crawl archives were created by a reliable source (Starling Lab).

There is a public Browsertrix Cloud maintained by the WebRecorder Project, as well as a private instance of the cloud set up and maintained by Starling Lab for the creation of private web crawls. With this cloud, we are able to send commands and automatically scrape and preserve sets of webpages.

Browsertrix CLI

The Browsertrix Crawler CLI tool requires you have Docker, open-source software that allows you to run the crawler in a lightweight virtual machine, installed on your computer. With this tool, you can run crawls on the public instance of Browsertrix

Within the sandboxed Docker container on your machine, you can give commands and configure the crawl with a YAML file, to crawl a website and it will output the zipped .wacz file. One limitation of this tool is that you cannot add authsign crytographic signatures to it.

Automation & Custom Configuration

Both of these options allow you to set up an automatic web crawl based on certain ‘starting point’ URLs, with custom configurations for an automated crawl for several pages. You can set options like which specific pages you want to crawl, how many links deep you want to crawl, whether you want to block ads, take screenshots, customize the kind of output you get from the crawl, and more.

Manual Crawler

The **ArchiveWeb.page** chrome extension allows you to install a chrome extension, then use that extension to make a recording of all of the pages that you visit. You can use the Autopilot feature to automatically scroll and capture all the pixels on a given page, and navigate to a view of the archive you have created.

ArchiveWeb.page.gif