Problem

A singe Lily node is incapable of extracting chain state at the rate the Filecoin blockchain is growing. This causes Lily to skip processing data while watching the chain. This is bad as it prevents Lily from meeting the goals we are trying to achieve (mostly 1-4).

We have tried to work around this by allowing Lily to spend 2*BlockTime (60s) processing chain state while queueing subsequent work that comes in during the time its processing. The thought here was that though Lily may fall behind briefly, it would eventually catch back under the assumption that some tipsets take less than a BlockTime (30s) to process. Unfortunately this has not been the case and we have arrived at the problem outlined here:

As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.

Lily is essentially acting as a queue with a single processor attached and unfortunately its average capacity utilization is 100%, meaning Lily will never catchup.

Solution

In order to keep up with the chain we need to distribute the work across multiple lily nodes - Illustrated by the diagram below:

Screen Shot 2022-04-13 at 6.05.32 PM.png

Now that https://github.com/filecoin-project/lily/pull/869 has landed Lily is ready to operate in such a configuration; a (once) working prototype of this pattern was implemented in https://github.com/filecoin-project/lily/pull/886. Said prototype was able to keep up with the chain operating at a confidence of 1. In order to turn the prototype into a production-ready solution we need to settle on a queue (Redis, RabbitMQ, ActiveMQ, Kafaka, etc.) and worker pattern (software that creates and assigns work and managers a pool of workers) implementation. I will first lay out our options for a worker pattern since the queue implementation we use will be depend on that.

Worker Pattern

A worker pattern has a mechanism (client) to distribute work across multiple machines (workers). The client creates tasks and places them into a queue. A worker pulls a task from the queue, completes it, repeat. The number of workers can scale horizontally to accommodate load produced by the client. In the diagram (and prototype) above Lily can act as both a worker (Lily Daemon) and a client (Lily Daemon Notifier). A task in our case is a TipSet to process. There are a few OS solutions that implement this pattern, but we could also make our own. Below I will outline the OS solutions - I don’t think we should build our own, but we could fork an existing solution to meet our needs, or even contribute back upstream.

Asynq

From the README:

Asynq is a Go library for queueing tasks and processing them asynchronously with workers. It's backed by Redis and is designed to be scalable yet easy to get started. Highlevel overview of how Asynq works:

Client puts tasks on a queue

Server pulls tasks off queues and starts a worker goroutine for each task

Tasks are processed concurrently by multiple workers

This library is currently undergoing heavy development with frequent, breaking API changes. As of today (April 13th 2022) the last commit to master was 12 hours ago. This appears to be an active and healthy project. The prototype mentioned above used this library.

Pros:

Actively developed
Well documented
Simple
Has own WebUI asynqmon

Cons: