Archiving and preservation for research environments

Archiving Genomic and Imaging Data

social_scienceNatural Sciences
Engineering and TechnologyEngineering and Technology
Medical and Health SciencesMedical and Health Sciences
Agricultural SienceAgricultural Sciences

CRG / CNAG (Centre for Genomic Regulation / Centro Nacional de Análisis Genómico)

Organisation type: 
Research institution
Organisation size: 
Large organisation
Organisation Profile: 

The Centre for Genomic Regulation (CRG) is an international biomedical research institute of excellence, created in December 2000. It is a non-profit foundation funded by the Catalan Government through the Department of Business & Knowledge and the Department of Health, the Spanish Ministry of Science & Innovation, the "la Caixa" Banking Foundation, and includes the participation of Pompeu Fabra University.

The mission of the CRG is to discover and advance knowledge for the benefit of society, public health and economic prosperity. The CRG believes that the medicine of the future depends on the groundbreaking science of today. This requires an interdisciplinary scientific team focused on understanding the complexity of life from the genome to the cell to a whole organism and its interaction with the environment, offering an integrated view of genetic diseases.

The CNAG-CRG is a non-profit organization funded by the Spanish Ministry of Economics Affairs & Digital Transformation and the Catalan Government through the Economy and Knowledge Department and the Health Department. Competitive grants and contractual research with the private sector provide additional funds. From the 1st July 2015, the CNAG was integrated into the CRG.

The CNAG-CRG was created in 2009 with the mission to carry out projects in DNA sequencing and analysis in collaboration with researchers from Catalonia, Spain and from the international research community in order to ensure the competitiveness of our country in the strategic area of genomics. It started operations in March 2010 with twelve last-generation sequencing systems, which has enabled the center to build a sequencing capacity of over 1000 Gbases/day, the equivalent of completely sequencing ten human genomes every 24 hours. This capacity positions the CNAG-CRG as one of the largest European centers in terms of sequencing capacity. The Center has a staff of highly qualified individuals, 50% of which hold PhD degrees. The bioinformatics team together with our outstanding computing infrastructure (9 petabytes of data storage and over 3000 cores of computing) also positions the CNAG-CRG as a center of excellence in data analysis.

The CNAG-CRG takes part in genome sequencing and analysis projects in areas as diverse as cancer genetics, rare disorders, host-pathogen interactions, the preservation of endangered species, evolutionary studies and the improvement of species of agricultural interest, in collaboration with scientists from universities, hospitals, research centers and companies in the sector of biotechnology and pharma.

Problem definition

The CRG and CNAG generate large amounts of biological data using state of the art instrumentation including next generation sequencers, high resolution microscopes and mass spectrometers. The data is used for cutting edge research in the following fields:

  • Bioinformatics and Genomics
  • Cell and Developmental Biology
  • Gene Regulation, Stem Cells and Cancer
  • Systems Biology
  • Rare Diseases

As well as generating data for the institutes’ researchers, the following core facilities provide services to external stakeholders and collaborators:

  • Advanced Light Microscopy Unit
  • Biomolecular Screening and & Protein Technologies Unit
  • Proteomics Unit
  • Tissue Engineering Unit
  • Flow Cytometry Analysis and Cell Sorting Unit
  • Bioinformatics Unit
  • Genomics Unit

Various types of sequencing services are offered:

  • DNA Sequencing
  • Long Read Sequencing
  • Transcriptome Sequencing
  • Epigenetic Sequencing
  • Single Cell Sequencing
  • 3D Genomics

Currently CRG and CNAG have over 14 petabytes of storage in various systems at the institute. Data must be stored securely for long periods of time so that re-analysis can be done as new methods and techniques are developed. As well as pure research, the instruments at the institute are being applied in the clinical domain necessitating the encryption of data and compliance with various regulatory requirements such as the GDPR.

Since CRG and CNAG provide services and collaborate with a large community external to the institute, they also require means to share data securely with them and allow performant access. 

Envisaged timeline for implementation of the use case

Currently, CRG and CNAG data are stored in various disk and tape based systems. The institute maintains replicas in 2 data-centres in Barcelona separated by a distance of 7km. Keeping  up with the data generation rate, and the multitude of different storage systems presents a challenge for the IT team. CRG and CNAG are looking at ways to consolidate and rationalise these systems to simplify management and to improve cost-efficiency.

As well as needing to solve the problem of providing safe, secure and cost-effective storage the institute is actively looking at ways to provide FAIR access to data and to allow secure and performant access to external collaborators. CRG and CNAG need a system that can cater to all of the following elements:

  • Safe storage: guarding against silent data corruption, multiple replicas across multiple locations, snapshotting, checksumming,
  • Secure storage: flexible access controls, federated identity management, encryption, regulatory compliance
  • FAIR storage: linking data with rich metadata allowing complex querying
  • Accessible storage: allowing efficient bulk transfer of data utilising fast protocols such as gridftp.
  • Cost-effective storage: low cost per terabyte per month and minimal additional costs (e.g. transfer costs, api usage costs). It would be preferable to have everything amortized into a single figure rather than having to deal with multiple line items to estimate costs.

Based on these very complex problems and the amount of work needed to transition from current systems to a new system we would estimate a timeline of around 2 years for development and implementation.

Data and metadata Characteristics

The greatest proportion of the data is in the form of standard NGS formats such as fastq and bam files. CRG and CNAG are increasingly dealing with large amounts of fast5 files generated by nanopore sequencing. There is also a large amount of imaging data in the form of tif, jpg, nd2 and raw files.

In total the institute has over 1.5 billion files ranging in size from tens of kilobytes up to a few hundred gigabytes. Large amounts of small files provide difficult technical challenges.

CRG and CNAG are currently growing at a rate of around 2 petabytes a year but this is likely to increase substantially over the coming years. There are various LIMS systems and database systems in  use for storing metadata associated with the data generation pipelines.

Cost requirements

CRG and CNAG have carried out detailed analysis of the cost for storage which has been audited so that we can accurately do both internal and external billing for cost-recovery. Since the institute currently hosts our own data, it does not incur any transfer costs or api costs.

Benefits and expected impact

The main tangible benefits would be

  • Consolidation and rationalisation of storage systems
  • Reduction in costs of storage on a euro-per-terabyte-per-month basis
  • Integration of data and metadata according to FAIR principles
  • Providing a means of secure and performant access to external stakeholders

In terms of intangible benefits to CRG and CNAG's researchers and wider community, the institute would expect the storage to become less of a worry allowing them to concentrate on the science instead of having to deal with the problems of ‘storage juggling’. Having greater visibility of what is stored would also allow more collaboration and possibly create new links between groups that would not have been possible previously. Finally having access to a secure and regulatory compliant storage solution would allow the institute to increase the cooperation it has with local hospitals and to be able to take on more clinically significant projects.