Top Level Goals
- P0: Improve performance and reliability of data transfer stack in storage and retrieval deals for Estuary
- Metrics:
- Estuary data transfer speeds (Sending , Receiving)
- Estuary resource usage (RAM)
- Estuary data transfer success rates (need to build dashboards here)
- Estuary uptime related to data transfer (we only have general stats, but this mostly covers issues that cause massive drops in data transfer speeds necessitating a restart)
- Goals:
- near 100% data offloaded in Estuary (currently blocked on data transfer speeds)
- 10x Improvement in overall Estuary data transfer speeds
- Estuary network connections currently 20x available bandwidth of what is being used
- 99.9% uptime (min 1 wk/between restarts related to data transfer)
- 99% Estuary data transfer success rate
- P1: Scale the ability of multiple team members (and network) to make meaningful improvements to the data transfer stack
- Goals:
- By the end of the project, everyone on the team should understand and align on the core design and architecture of the stack.
- By the end of the project, everyone on the team should understand the steps needed to triage issues, identify root causes, and implement solutions.
- There should be a clear set of learning materials and learning path to achieving competence in our data transfer stack so that future contributors can also come on board
- P2: Alignment of data transfer stack with design of Markets V2, future retrieval clients, other usages of go-graphsync
- Goals:
- Definite minimal interfaces for data transfer to support current and future clients, refactor towards those clients.
- Use additional clients for performance testing as needed.
Primary High Level Strategies
- Embed in Estuary, have input from other graphsync users - Markets V2, Retrieval Client, Provider, etc:
- Our work should be based on the needs of those who use data transfer. The best way to do it is whole or partial embedding. We also want shared understanding about data transfer.
- Build significant on demand request introspection into go-graphsync and go-data-transfer, and expose this in estuary. We should be able at any given time to collect significant history about what has happened with a transfer. We may also want introspection at the peer level. Introspection tools will be available to miners who want to use them.
- Focus initial performance work on a controlled set of miners — Magik + Sofia Miner initial, expanding to MinerX. Prove improvements can maximize bandwidth with select miners before expanding pool.
- Revisit go-data-transfer / go-graphsync boundary — Graphsync is a transport. Go-data-transfer is a control protocol (mostly for facilitating optimistic fair exchange). Currently the boundary is extremely complicated, and go-data-transfer does a lot that is not truly transport independent.
- Over prioritize ramp-up initially, recognize initial progress will be slower. Use smaller refactors as a way to ramp up. There is no way we can have real discussions about performance and large refactors without baseline shared competency among the team. It is ok and even expected to have a couple weeks of small refactors and bug fixes prior to making big decisions. Prioritize refactors that help people understand code first.
Secondary Strategies
- Clear triage process for incoming Estuary issues
- TPM triages severity as needed determine
- P0 - drop work and fix it
- P1 - diagnose root cause in 1-2 days, fix in a week
- P2 - diagnose / fix as it reaches top of prioritization queue
- Lead diagnosis issue and proposes fix
- eventually all team members can do this
- Implementation of solution is prioritized separately as needed
- Build regression testing across versions to run in CI for releases
- Possibly refactor go-graphsync toward:
- More deterministic concurrency. The best way to minimize locks, race conditions, etc is to simply minimize things that happen asynchronously.
- Simpler network protocol. One request/response per message
- Pushing selector traversal controls into go-ipld-prime, removing complex wrappers of go-ipld-prime in the codebase
Scope Limits
- Initial duration is 3 months. At 3 months, reassess whether goals are met, and whether other critical goals have emerged, and whether project should be extended
- Protocol focus is go-graphsync. Implementation of other protocols is primarily to prove out go-data-transfer support for multiple protocols
- Team at proposed size is NOT also developing the web3 retrieval client. (see Team section)
- Core code repos are go-data-transfer, go-graphsync
- Delve into other dependent repos (go-ipld-prime, blockstore impls, etc) only as needed to deliver on core goals (flag of potential problem: selector DOS - large effort, possibly critical to performance)