Decision made to go for Box. Admin users are Alex Butterworth, Anna-Maria Sichani and Arran Rees. Speak to either of them if you want to know more or gain access to the dataset storage area.

Scope of document

This documents outlines options for data storage of large files and datasets (50MB → 10GB) within the Congruence Engine project. The primary use case here will be 3rd party data sets used within active investigations or digital pipelines.

Note: It does not cover the day to day storage of Word and PDF documents used within the project to communicate proposals and document the project itself as it is assumed these will live within Basecamp or Notion.

Size of Files

For the purpose of this document small dataset is considered to be under 50MB and capable of being uploaded via a web browser and/or transmitted as an email attachment. While a large dataset is considered as a file or set of files over 1GB that may need a a speciality upload tools and whose storage size may have cost or bandwidth issues attached to it. As of today the CE project is not proposing to hold data sets in excess of 10GB, such as large volumes of video content, which should be held and processed on an as needed basis (with separate technical consideration) due to the cost of storing such large volumes of content.k

Primary users

There are a variety of users who may need access to the above datasets, although further clarification (and pragmatism) may be required in order to keep costs and administration within sensible limits. For example if may make sense to use a more consumer product like Google Drive / Google Workspace for a particular subset of users, where ease of administration and granular access control is required. While using a service like Amazon AWS S3 for very large data sets where storage cost are a bigger issue, administration and access is more complex, but where a much smaller number of technical users need access.

Recommendation

Use a service like Google Drive or Dropbox for documents up to 1GB, alongside AWS S3 buckets for very large datasets used by a small set of technical project partners as needed.

Alternatively use SMG OneDrive, expanded to non SMG users where required. Again this could be used alongside AWS S3 buckets for very large datasets used by the a small/select technical project partners.

Due to the risk of rapidly increasing storage and the associated costs. Someone should be tasked with maintaining a broad ledger of what content is being stored, the reasons why and when the data is no longer needed, this could take the for of a ‘Readme’ file in each folder explaining the data held within that folder along with some general auditing / housekeeping.

Also, Box looks like a very compelling offer / product. But would like to better understand the minimum # of users, if external collaborators are free on Business Plus plan? As that makes it £60/month - still a modest / competitive cost, but be good to understand the need for min three accounts and how thi use accounts differ from external collaborators. Certainly a strong contender if external users aren’t limited, less compelling if £20/month per user who needs edit / upload rights.

https://www.box.com/en-gb/pricing

Some additional thoughts