Large neuroscience datasets require specialized workflows for breaking datasets down into storage-efficient chunks, ensuring persistent availability, validation of interoperability specifications, and managing metadata for linked resources.
The Opscientia team has begun testing decentralised file storage workflows with small datasets for unit-tests with plans for building up to 284TB of mixed content datasets indexed by the metadata aggregator datalad.
Resources
‣
‣
GitHub - opscientia/desci-storage
https://miro.com/app/board/o9J_ltSfj8M=/?invite_link_id=921306634511
Problem Statements
Storage design and optimization
- What is the optimal storage format (batch storage of chunked data) that can efficiently be stored on Filecoin, retrieved from Filecoin, shared over IPFS, and plays well with the standard specifications?
- How can Filecoin be used to easily migrate/persist data already served over HTTP by single-nodes in the network?
- What are the long-term costs of storage on Filecoin and how will they scale with the project? What is the cost/performance trade-off comparison with legacy storage solutions (institutional or cloud service providers).
Hybrid storage
- If participants are pinning subsets of the data but still find certain portions of the data missing, how can they efficiently re-hydrate the cached data from deals on Filecoin? This may include simple retrieval scripts or may include research into development of the Filecoin retrieval markets.
- How can we build resilient systems that utilize the best properties of legacy storage solutions with IPFS and Filecoin?
Retrieval capabilities
- How do we minimize egress fees to keep data open source / public?
- How do we publish Filecoin deal metadata for the public to ensure transparency?
- How do we automate the curation of deals (e.g. renewal of expiring storage contracts).
Experiments