Technical infrastructure

Here is a space to document discussions, challenges and decisions made during the project regarding our technical infrastructure.

Project documentation and communication - Notion

From early the project adopted Notion as a platform for supporting collaborative documentation and knowledge exchange across the project, alongside the existing use of BaseCamp for wider project communication. All members of the team can find information about: project meetings; ongoing research investigations; developing project pipelines; events and publications; project partners; and the CE approach to ethics. Two sections within the Notion wiki contain detailed information about data and datasets and investigations. The pages on ‘data documentation’ include a database with information about all datasets identified, requested and collected; guidance on data documentation, file-naming conventions etc. Nine members of the team with full editing rights (guest editors can be added to particular pages at no cost) have also been adding weekly reflective notes, which describe what they have been working on as well as any thoughts and ideas that have been sparked by collaboration. Supplemented by general team updates and newsletters, this feature has been enormously beneficial in creating a sense of shared ownership across the project.

Datasets management and curation storage (short term) - Box

A huge step towards Congruence Engine’s digital re-orientation has been the integration of a large number of datasets from project partners over the last few months: more than 11.000 files and 129.2GB of data (+1,5TB to come soon) have been integrated and curated by the project’s team. These newly acquired datasets are allowing the project team to conduct investigations that span a diverse repertoire of collections/datasets and to work at scale. In order to ensure the effective curation and responsible (re)use of datasets within the project, we have developed and employed a comprehensive data management strategy, adopting FAIR principles for all data-related processes throughout the project, including data ingestion, storage, maintenance, curation and (re)use. Among other things, we have developed a database of datasets (currently hosted in Notion as csv), enhanced documentation including a datasheet for each dataset, file naming conventions and copyright/licensing guidance. Data model cards are also going to be added as part of the documentation. It is planned to treat the data-focused outcomes of investigations as research outputs in their own right, by developing an Open Source infrastructure (GitHub) and strategy to document and openly disseminate code, datasets and findings. This is very much in line with early decisions for REF2029 aimed at the inclusion of non-traditional outputs in the evolving landscape of interdisciplinary research culture.

For data ingestion and short-term storage, a light version of Box has been employed, to address the immediate imperative of gathering data for inspection and assessment,as a first step towards the aggregation of partners’ datasets (master files), with parallel attention to the size of the data, the diversity of file formats and the different licensing frameworks that apply. Additional or alternative requirements for data storage and management will be analysed and solutions tested, as they arise.

Outputs documentation and publication - GitHub

A project-owned GitHub repository is also maintained so that working datasets and code can be shared and developed among team members as part of ongoing investigations. A dedicated data management task has been assigned to the Digital Humanities Research Fellow to ensure smooth and efficient data ingestion and sustainability throughout the project. A responsible open access strategy informs the data management, processing and infrastructure choices throughout the project.

CE_tech_infrastructure.docx