Assessing the FAIRness of the ARCHIVER long-term data preservation

Digital preservation has emerged in recent years as a fast-moving and growing community of practice that is of ubiquitous relevance, but in which capability is unevenly distributed. Digital preservation in the research community has a close alignment to the FAIR principles and is delivered, albeit unevenly, through a complex specialist infrastructure comprising not simply technology but also the capacity of staff and 'know why' of policy (See Currie, Amy, & Kilbride, William. (2021). FAIR Forever? Long Term Data Preservation Roles and Responsibilities, Final Report (Version 7). Zenodo https://doi.org/10.5281/zenodo.4574234)
The European Open Science Cloud (EOSC) initiative has extensively worked to promote and enable access to Open Science data with the stated aim of ensuring that researchers can maximize the value of their research processes, sharing large-scale Research Infrastructures (RIs). The importance of advanced long-term preservation to allow reproducibility of research results is emphasized by the EOSC Strategic Research and Innovation Agenda (SRIA) and different reports of relevant bodies such as the Digital Preservation Coalition.

ARCHIVER is providing a substantial contribution to this vision.

Started in January 2019, ARCHIVER is a unique initiative currently running in the EOSC framework that is competitively procuring R&D services for archiving and digital preservation. The ARCHIVER tenderers were selected through an open and competitive procurement process. Between December 2020 and August 2021 three consortia worked on innovative, prototype solutions for long-term data preservation, in close collaboration with  CERN, EMBL-EBI, DESY and PIC. ARCHIVER procured R&D services that address the long-term preservation needs across the entire research data management cycle. The resulting services are sustainable and provide the needed functionality at scale that can implement FAIR Data Management Plans, using Trustworthy Digital Repositories (TDRs) certified according to best practices (e.g. ISO 16363 and CoreTrustSeal).

As part of its R&D validation process, ARCHIVER needs to assess the FAIRness of the resulting ARCHIVER repository services.

The F-UJI tool developed in the context of the FAIRsFAIR project  responds to this need as it provides programatic assessment of FAIRness of research data objects based on metrics developed by the FAIRsFAIR project, breaking it down in concrete tests that could be included in ARCHIVER.


The initial assessment started by gathering some basic information about the current repositories from the organisations involved in the ARCIHVER project, namely EMBL-EBI, DESY, PIC and CERN to get familiarised with the tool.

The following information was shared:

  • Data domains (scientific discipline, community)
  • Assessment Target (e.g. subset of data holdings)
  • Data access level (e.g. if restrictions in place)
  • Meta(data) dissemination (OAI-PMH, REST, Content Negotiation, Schema.org)
  • Metadata standards (e.g. DDI Dublin Core, schema.org etc.)
  • Semantics (SPARQL endpoint, Vocabularies)
  • Data formats (e.g. discipline specific formats)

The following data sets were used for a preliminary test of the FAIR assessment tool:

Provider

Datasets

EMBL-EBI The ‘1000 genomes’ dataset contains 1000 human genomes, all publicly available with no restriction
DESY Serial femtosecond crystallography data and metadata including links to CrystFEL Beam File, CrystFEL Geometry File, Processing Scripts and diffraction patterns
CERN Audiovisual recordings of talk  of a conference; Example of a CMS collision dataset in AOD format; example of a CMS simulated dataset in AODSIM format; a simple example of an OPERA neutrino event dataset
PIC Fake dataset mimicking one night of raw data from the MAGIC Telescopes

Download the story ►