The Centre for Genomic Regulation (CRG) is an international biomedical research institute of excellence, created in December 2000. It is a non-profit foundation funded by the Catalan Government through the Department of Business & Knowledge and the Department of Health, the Spanish Ministry of Science & Innovation, the "la Caixa" Banking Foundation, and includes the participation of Pompeu Fabra University.
The mission of the CRG is to discover and advance knowledge for the benefit of society, public health and economic prosperity. The CRG believes that the medicine of the future depends on the groundbreaking science of today. This requires an interdisciplinary scientific team focused on understanding the complexity of life from the genome to the cell to a whole organism and its interaction with the environment, offering an integrated view of genetic diseases.
The CNAG-CRG is a non-profit organization funded by the Spanish Ministry of Economics Affairs & Digital Transformation and the Catalan Government through the Economy and Knowledge Department and the Health Department. Competitive grants and contractual research with the private sector provide additional funds. From the 1st July 2015, the CNAG was integrated into the CRG.
The CNAG-CRG was created in 2009 with the mission to carry out projects in DNA sequencing and analysis in collaboration with researchers from Catalonia, Spain and from the international research community in order to ensure the competitiveness of our country in the strategic area of genomics. It started operations in March 2010 with twelve last-generation sequencing systems, which has enabled the center to build a sequencing capacity of over 1000 Gbases/day, the equivalent of completely sequencing ten human genomes every 24 hours. This capacity positions the CNAG-CRG as one of the largest European centers in terms of sequencing capacity. The Center has a staff of highly qualified individuals, 50% of which hold PhD degrees. The bioinformatics team together with our outstanding computing infrastructure (9 petabytes of data storage and over 3000 cores of computing) also positions the CNAG-CRG as a center of excellence in data analysis.
The CNAG-CRG takes part in genome sequencing and analysis projects in areas as diverse as cancer genetics, rare disorders, host-pathogen interactions, the preservation of endangered species, evolutionary studies and the improvement of species of agricultural interest, in collaboration with scientists from universities, hospitals, research centers and companies in the sector of biotechnology and pharma.
The CRG and CNAG generate large amounts of biological data using state of the art instrumentation including next generation sequencers, high resolution microscopes and mass spectrometers. The data is used for cutting edge research in the following fields:
As well as generating data for the institutes’ researchers, the following core facilities provide services to external stakeholders and collaborators:
Various types of sequencing services are offered:
Currently CRG and CNAG have over 14 petabytes of storage in various systems at the institute. Data must be stored securely for long periods of time so that re-analysis can be done as new methods and techniques are developed. As well as pure research, the instruments at the institute are being applied in the clinical domain necessitating the encryption of data and compliance with various regulatory requirements such as the GDPR.
Since CRG and CNAG provide services and collaborate with a large community external to the institute, they also require means to share data securely with them and allow performant access.
Currently, CRG and CNAG data are stored in various disk and tape based systems. The institute maintains replicas in 2 data-centres in Barcelona separated by a distance of 7km. Keeping up with the data generation rate, and the multitude of different storage systems presents a challenge for the IT team. CRG and CNAG are looking at ways to consolidate and rationalise these systems to simplify management and to improve cost-efficiency.
As well as needing to solve the problem of providing safe, secure and cost-effective storage the institute is actively looking at ways to provide FAIR access to data and to allow secure and performant access to external collaborators. CRG and CNAG need a system that can cater to all of the following elements:
Based on these very complex problems and the amount of work needed to transition from current systems to a new system we would estimate a timeline of around 2 years for development and implementation.
The greatest proportion of the data is in the form of standard NGS formats such as fastq and bam files. CRG and CNAG are increasingly dealing with large amounts of fast5 files generated by nanopore sequencing. There is also a large amount of imaging data in the form of tif, jpg, nd2 and raw files.
In total the institute has over 1.5 billion files ranging in size from tens of kilobytes up to a few hundred gigabytes. Large amounts of small files provide difficult technical challenges.
CRG and CNAG are currently growing at a rate of around 2 petabytes a year but this is likely to increase substantially over the coming years. There are various LIMS systems and database systems in use for storing metadata associated with the data generation pipelines.
CRG and CNAG have carried out detailed analysis of the cost for storage which has been audited so that we can accurately do both internal and external billing for cost-recovery. Since the institute currently hosts our own data, it does not incur any transfer costs or api costs.
The main tangible benefits would be
In terms of intangible benefits to CRG and CNAG's researchers and wider community, the institute would expect the storage to become less of a worry allowing them to concentrate on the science instead of having to deal with the problems of ‘storage juggling’. Having greater visibility of what is stored would also allow more collaboration and possibly create new links between groups that would not have been possible previously. Finally having access to a secure and regulatory compliant storage solution would allow the institute to increase the cooperation it has with local hospitals and to be able to take on more clinically significant projects.