Archiving and preservation for research environments

OMC process FAQ

To stimulate an open dialogue with companies interested in the ARCHIVER project, all information given in answers to questions raised by potential suppliers during the Open Market Consultation Process have been documented and published in this section.

 

Questions and Answers

 

Procurement / Legal questions

What will be the time period during which contractors have to commercially exploit the results of the PCP before ownership of the results transfers to the Buyers Group ?

This time period has not been fixed yet. However, and as an indication, in HelixNebula Science Cloud, the last PCP project coordinated by CERN,  the contractors were given 2 years to commercially exploit their results.  The time period for ARCHIVER will be confirmed in the Framework Agreement final draft released with the Request for Tenders.

How will the actual price and virtual price be used to evaluate tenders ?

It is likely that only the actual price will be taken into consideration for the evaluation of the bids. The final evaluation criteria and formula will be indicated in the Request for Tenders.

If firms reply to the Request for Tender in a Consortium, do the firms need to be from the  same country?

No. Firms replying to the Request for Tender in a Consortium can be from the same country or from different countries. There is no restriction on where firms responding to the Request for Tender have to be located. However, it is a requirement of the PCP that the majority of the R&D activities, including the main researchers working on the contracts, be located in EU Member States or Horizon 2020 Associated Countries.

Any Results we develop would be significantly based on our Background IP. Considering that, how is it possible for you to use and possibly sublicence the Results?

The Buyers Group will require a sub-licensable licence to use the Results for the purposes of the Framework Agreement and for their own non-commercial use. This would include a sub-licensable right to use any Background IP or Sideground IP owned by the contractor that is necessary for the use of the Results for the afore-mentioned non-commercial purposes.
In addition, the Buyers Group may require a sub-licensable licence to exploit the Results.
This would include a sub-licensable right to exploit any Background IP or Sideground IP owned by the contractor that is necessary for the exploitation of the Results. However, any such licence would be granted under fair and reasonable conditions which may include financial compensation for the contractor. It is also intended that a one year “embargo” period would apply before the Buyers Group may require any such licence.
Please refer to the draft Framework Agreement section 7.3 for further details. In order to distinguish between Background IP and Results, the first deliverable you will have to provide under the PCP contracts is a declaration of your Background IP.

Is it the lead Contractor or every member of the Consortium that needs to fulfil the selection criteria? 

For some selection criteria, it will be required that the Consortium as a whole (i.e. just one member of the Consortium) meets the specified criteria. For others, it will be required that each member of the Consortium meets the specified criteria. This will be clearly specified in the Request for Tender documents. We do not plan to impose selection criteria that must be met specifically by the lead Contractor.

Will the bids submitted to the future Request for Tenders be published?

No. The Request for Tenders is published openly, but the submitted bids are treated confidentially. The Request for Tenders will contain details about confidentiality and the circumstances in which the Buyers Group will use and share information in bids.

Contractors keep ownership of the results developed in the PCP project, but it was previously mentioned that open source solutions are favoured. How are these two things compatible?

The contractors keep ownership of the Results developed in the PCP. However, points will be attributed in the evaluation of the bid if the proposed services favour the use of open source licenses. More information about the scoring of bids will be provided in the award criteria in the Request for Tenders.

Right now, in the draft tender documents on your website, the numbers are not provided (XXX) for the weighting of the Award Criteria. Will this information be provided?

Yes, the Buyers Group will agree on the numbers and publish them in the Request for Tenders released in October 2019.

What is the maximum number of tenderers per phase?

We anticipate working with a minimum of four tenderers in the design phase, three in the prototype phase and two in the pilot phase. We may contract with more if the budget is sufficient and a sufficient number of compliant tenders are received.

The requirements from the different deployment scenarios vary a lot. Can we bid for just one of the deployment scenarios?

No, the Tender requirements are the same for all Tenderers. All the selection criteria have to be met, as a minimum basic requirement, and the selection criteria cover the minimum level needed across all deployment scenarios. We would also emphasise that, while the precise needs across deployment scenarios differ, there are also many commonalities. We believe that future (post-PCP) market applications of your solutions, including for other customers, will need to address all of these commonalities to be commercially successful.

That said, we do expect variation in the solutions that are developed during the PCP and it is likely that different Tenderers will want to focus on different aspects of the overall R&D challenge. You will specify in your Tender the area or areas of the R&D challenge you propose to focus on. The award criteria will be used to attribute a quality score to your Tender based on this information. Scoring guidelines and weighting for each award criterion will be indicated in the Request for Tenders so that you can build your Tender accordingly.

Given that there is some variation across deployment scenarios, should our bid be based on a single service or a suite of services?

We do not intend to specify one or the other.

 

Can a country outside Europe participate in the ARCHIVER project?

Organisations from Non-European countries can participate in the ARCHIVER project. Demand side organisations are welcomed to join the Early Adopters Programme by expressing interest using the online form, before 22nd September. No geographic restrictions apply to organisations taking part in the Early Adopters Programme. Supply side organisations can respond to the ARCHIVER tender under the condition that the majority of the R&D activities, including the principal researchers working for the PCP contract, are located in EU Member States or H2020 Associated Countries.

Questions related to the Deployment Scenarios (generic)

Where is the data from the deployment scenarios ingested from? Is the ingest done after data calibration/validation? Are there more scenarios?

Typically, raw data has to be calibrated and validated before ANY scientific process can take place. The goal is not to pour bits as far as possible into a bucket. The goal is to ingest those bits together with all the necessary associated information into a long-term OAIS preservation archive.

Is the replication of the data partial? Are there multiple different scenarios?

There are multiple different cases for data replication. For the majority of the deployment scenarios, the Buyers Group has several copies held within the same system but no external disaster recovery mechanism.

In which scenario(s) are we looking into full remote archive deployments?

All of the proposed deployment scenarios would benefit from full remote archive deployments.

What are the data access patterns requirements (how diverse and how complex for the deployments/use cases presented)?

The data access patterns vary drastically from one deployment scenario to another. There are cases where data are very rarely recalled, e.g.even less than once in a year, but there are also cases in which data might need to be accessed on a daily basis.

Is there a difference between ingest under OAIS and simple ingest?

The Buyers Group have agreed to follow the OAIS reference model as the “best way” of ensuring long-term preservation of data. “Simple ingest” suggests that some or  most of the OAIS guidelines would be skipped. That’s not the intention underlying the proposed project.

One of the R&D challenges should be ensuring a link between the data and the research institute that has created the data?

Yes, however there are different behaviours expected on the different deployment scenarios. 

The presentation in Geneva mentioned unstructured data. This has implications in managing the personal data. Is it part of the project?

In some use cases, data will have personal data included. One of the use cases is dealing directly with this question. Handling personal data according to European legislation (GDPR) is a requirement in the project. 

The open data model is led by the USA since a long time. Is there a model that will be followed in the project?

The EC is pushing for the open data movement, specifically for data produced by the public sector.

There are no archivist organisations in the ARCHIVER consortium. Why?

There is at least one archivist organisation interested in being an Early Adopter of the ARCHIVER resulting services.

The companies are doing a risk evaluation in the planning poker based on someone’s own experience or based on what is available on the market?

As the R&D challenges are complex and no single company can meet them all currently, we need to take into account not only a single experience but also the wider knowledge of the market and the current state-of-the-art.

In the tender documents, it seems that the archive is not running on the Buyers infrastructure. So how is the following atomic use case related “As an Collaboration Data Manager, I can provide a transparent service (F and A from FAIR) to the user by deploying a federated storage environment between multiple research centre archives and commercial archives and by providing catalogues that contain data I own and data that is managed/produced and stored so that Data stored in different locations can be searched and downloaded via the archive in a seamless way, irrespectively of where it is maintained/produced”?  

There are different requirements foreseen in each of the deployment scenarios.

We can consider three types of relevant data for this project: structured scientific data, communications data and other supporting data. Is it just scientific data?

All types of data are in scope. More information about the data types is available in the deployment scenario slides from the OMC event in Stansted: https://www.archiver-project.eu/open-market-consultation-event-london-stansted-airport

Are banking organisations going to be early adopters?

ARCHIVER is not talking to banking sector. Early adopters will be public organisations in the research domain.

ARCHIVER wants to achieve Long Term Data Preservation or only Data Preservation? It is important as it has an implication on the format.

Some use cases are based on long time data preservation e.g. BaBar. Some other, e.g. EMBL, are more focused on storage. Some use cases use custom research data formats (such as ROOT), while other use cases use wide-spread formats (such as JPEG, TIFF, PDF).  In the latter case, a proper full-scale long-term data preservation handling including format conversions is more necessary than in the former. The award criteria will reflect the respective weights of the different elements of the R&D challenge.

N.B. format conversion of (HEP) scientific data ALSO requires changes in the s/w. We (CERN) have done it at the scale of 1 per mil of the current LHC data. At the time, it took considerable resources to perform, one year to plan and test and one year to execute. Whilst these numbers do not scale with data volume (thanks to advancements in technology), this is NOT a trivial operation!

Q&A related to the Deployment Scenarios for Astronomy (PIC):

What does scrubbing mean?

Data scrubbing is an error correction technique that uses a background task to periodically inspect storage for errors, then correct detected errors using redundant data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors. (Adapted from Wikipedia.)

What is the role of LDAP in the context of PIC use cases?

In some deployment scenarios, the user authentication and user authorization for scientists working on a given project is centralized in a single existing ldap server operated by the buyer. In such cases, this server will be made available through the network, using industry-standard secure methodologies, for binding as an Auth/AuthZ provider to the supplier’s servers which provide data access. This binding may be direct, through a supplier provided proxy, or through a supplier provided credential translation service. The end result in all cases should be that users identify themselves through the existing, familiar mechanism in order to gain access to data for which they are authorized and which is stored in the supplier’s service.

On PIC use cases, ACLs are enforced at the file level, folder level or collection level?

For PIC’s use cases, it is sufficient to enforce Access Control at the folder level, through ACL or any other mechanism with similar functionality. A folder in this context is defined as a convenient way to refer to or interact with a group of files. For PIC’s use cases, collections are defined through metadata queries whose results are lists of files. The user will then attempt to access the files in the list, succeeding if allowed by the permission of the folder where the file is stored. An alternative, acceptable implementation would be to have file-level access control specification, but in this case the supplier would have to provide tools to easily set and modify access control specifications on lists of files resulting from metadata queries.

Q&A related to the Deployment Scenarios for High Energy Physics (CERN): 

The CERN Open Data Portal deployment scenario is referring to the XRootD protocol. Can you provide more information?

XRootD software framework is a fully generic suite for fast, low latency and scalable data access, which can serve natively any kind of data, organized as a hierarchical file-system like namespace, based on the concept of directory. More information on XRootD can be found in the relevant webpage: http://xrootd.org/. Please note that the XRootD protocol in the CERN Open Data deployment scenario is only necessary for the "live reuse" use case, and only for the data recall direction. The data upload direction can use any standard protocol such as HTTP. Moreover, in the basic "cloud archiving" use case, only the Service Managers will access the data on the Archive, and the support for XRootD protocol is not necessary.

For BaBar, what’s the restart capability requirement?

It should be possible to restart the ingest process at a reasonable check-point. Maybe this would be implemented by restarting the ingest of the current file, directory or other reasonable point.

Re BaBar: what are the requirements for the access of the data?

The current request is simply for an archive copy of the data to be resilient to the SLAC Directorate’s statement that the data can no longer be hosted at SLAC. Other copies of the data are likely to exist in the short-term and for short-term (re-)analysis it is these copies that are likely to be targeted (hosted at institutes that are members of the BaBar Collaboration and who therefore have the necessary ancillary infrastructure).

Re BaBar: How can you compare data?

Physicists will compare their current work with previous analysis. This includes statistical comparisons with data from other experiments e.g. The analysis from BaBar may be compared with analysis from Belle II. (Some of the BaBar data is unique - further details in the original request for Tina Cartaro). It is important to note that the “comparisons” do not expect bit-level agreement and are often eye-ball comparisons of histograms or other plots. This is the same technique as used to validate new s/w releases within on-going experiments.

Re BaBar: Is the file format such that 1 bit of corruption invalidates the whole file or just a subset?

The question is unclear, not sure if it relates to the failures of the fixity checks. HEP (High Energy Physics) has traditionally used file formats designed for unreliable tapes (e.g. Hydra, Zebra) that were resilient to errors. In case of problems, typically the current “event” would be skipped. As HEP moved away from tape towards disk as the primary support for production and analysis, the I/O software has (probably) lost some of this capability. This highlights the tensions between performance for on-going production and analysis and the lower performance needs of long-term re-usability.

Re BaBar: Please give more details about security for BaBar.

The security is minimal as it is physics data that does not include confidential or personal data.

CERN digital memory: Is it live data or historical data?

Live.

Personal information in CERN digital memory. Is there a need for GDPR content review as the data is moved into Archive?

CERN is not subject to GDPR but ensures the adoption of best practices for the processing operations of Personal Data governed by CERN Operational Circular 11 (OC11). This will need to be ensured by your solution.

These projects (BaBar and CERN Digital Memory) seem to not include any R&D.

We believe that proven production functionality at the PB range if complex data types, integrated in the EOSC context ensuring the full preservation life cycle etc do not currently exist. If they do, please provide some references. ARCHIVER is an opportunity to demonstrate that functionality.

Is CERNVM-FS a requirement? Is it accepted if another software can provide the same functionality? 

For the CERN Open data "cloud archiving" use case, the Archive does not need to support CernVM-FS, since the data will be accessed only by the Service Managers. For the CERN Open Data "live reuse" use case, the Archive will have to support CernVM-FS in order to serve the virtual machines and the software necessary for running example open data analysis completely decoupled from the usual CERN infrastructure.

PLEASE NOTE that CVMFS and CernVM have been offered by EGI (European Grid Infrastructure) - and hence also the EOSC context - since more than a decade. They are used to snapshot the s/w and the necessary associated environment and are widely used both across as well as outside HEP.

Q&A related to the Deployment Scenarios for Life Sciences (EMBL-EBI): 

What is the distribution pattern for the Life Sciences deployment scenario?

Our users come from pretty much everywhere, you can see a live map at https://www.ebi.ac.uk/web/livemap/. Heavily concentrated in Europe, the USA and China, but also from the southern hemisphere. Essentially anyone doing research into genomics, proteomics or related fields is very likely to download data from us at some point, if not regularly.

Is Food and Drug Administration (FDA) regulation relevant for the Life Sciences Deployment Scenarios?

No, though we have strict legal requirements for protecting some of our data - e.g. certain human genome sequences that must be accessed according to well-defined protocols.

In the following atomic use case: “As a user, I can deploy my own instances (development, testing & production) of the archive for multiple communities, e.g. on top of my own infrastructure, so that I can handle the diversity of different communities & use cases and don't have to trust on monolithic instance” are we talking about a software in the local infrastructure or an archive?

This means that we are not tied to using a particular platform for running an instance of an archive. I.e. I want the archive to be delivered as a platform/framework/whatever that can be deployed on any suitable hardware, much the same as I can install kubernetes or Openstack on a variety of base systems. This is for two reasons: 1) I want to avoid vendor lock-in, and 2) I want to be able to deploy an instance that a new community can play with in isolation, so they can experience it and familiarise themselves with it without having to commit.

How access to the information of the EMBL use case is managed?

Lots of data is public. A lot is also confidential. See slide 11 of the EMBL presentation from the Stansted OMC event for more information. In general, the metadata associated to the bulk data is public. Management of the EMBL metadata is outside the scope of ARCHIVER.

Is the time for the embargo period on the EMBL data defined, or is it sometimes linked to an event (e.g. a study publication) without a fixed deadline?

Embargo periods will not necessarily be fixed in time. They could be (e.g. for 6 months from submission of the data), or they could be bound to external events (e.g. until I publish my thesis)

Looking after metadata goes hand in hand with data preservation. How can it be out of scope?

There are at least two forms of metadata, domain-specific, and system-specific. The system specific metadata will be things like the creation time of a file, its size, its checksum, path/URI and name. These are things the archive should know and manage for us. The domain-specific data is things like what data-type it is (DNA sequence, protein sequence, medical image…), and how it was obtained (lab protocols, types of instruments and procedures etc).

We do not expect ARCHIVER to manage our domain-specific metadata. We already have portals which allow searching our metadata in the ways we need, and we do not intend to move away from them anytime soon. Our portals allow users to discover data and then retrieve them by giving them a URI, and it’s there that the ARCHIVE comes into play.

Collaboration aspect: Where are the authorisation is made in a group of collaborators?

Authorisation will typically be at the level of the portal accessing the data. This will be done with standard protocols, SAML, OAUTH etc. So we have our own identity providers, and it is up to any portal that guards the data to authenticate users against those providers.

Group membership will also typically be implemented in the portal itself. An external tool that could manage groups for us would be of interest, I don’t believe we currently have anything like that.

Are the EMBL applications you already have Linux-based or something else?

Authentication services are classic web-based API. The applications behind them are all Linux, or at least overwhelmingly so.

What are the access patterns to EMBL data like?

We don’t have clear information about this, however most of our data is still used. A PB of data downloaded per month with 20 billion requests per month. Because that comes from all over the world we can expect the patterns to be fairly flat, definitely not strongly peaked.

Are you planning to provide information about each Buyers Group access pattern, the type of data, metadata in scope, etc... ?

We are preparing technical summaries of each of the buyers’ requirements for publication.

Is EMBL is performing a cleaning of the data ?

A process of curation exists (POSIX) that then is moved to containers. Therefore the cleaning process will not be part of the ARCHIVER project. Once archived, EMBL may provide a new version of the files at some later date, but this will be a new file, with new file-related metadata, and domain-specific metadata specifying its relationship to the previous version of the file. An example of such a file would be the human reference genome. Every so often a new, updated version is released. The old one is deprecated, but not deleted. Many ongoing analyses will still need it, and future analyses may need to come back to it to understand any discrepancies between studies over time.

Does EMBL currently keep data in cloud or on premises?

On premises

If we propose something for EMBL to do the curation process as part of our solution?

We would not rule this out. We would be concerned about the financial impact of this. What EMBL is most interested in is you developing a storage and archive platform which is more financially attractive than our current in-house service.

Does EMBL encrypt data to manage access to it or for another reason?

Encryption ensures that the data can be stored elsewhere in a safe way. Encryption key is managed by a party outside of the organisation.

What is the expected functionalities and performance of the storage model?

We are not averse to the idea of tiering but it is not clear to us that this could help because we have a very long tail of access. We are expecting to use mostly warm storage.

What are your expectations beyond the infrastructure level (scale, ingest rate…)? Are you expecting the data to follow the  FAIR principles?

FAIR will be implemented in metadata portals to allow data discovery. In addition, FAIR has to be embedded in the applications which for EMBL are outside the scope of ARCHIVER (in the context of the EMBL deployment scenarios).

Do you expect the accounting for individual users?

User types will be grouped, Accounting for different group types will be made differently.

Q&A related to the Deployment Scenarios for Photon-Neutron Sciences (DESY):

In the deployment scenarios from DESY, it is required that a user can manage / add new versions (state) of derived/added data to existing archives. Does this correspond to storing delta?

Yes, but up to the file level (not the bit level) - the delta can have all changed files and new files

Q&A related GEANT Connectivity

What Quality of Service (QoS) guarantees does GÉANT have?

As GÉANT is not a commercial network operator,  it does not provide guarantees or SLAs but Service Level Targets. GÉANT Service Level Targets can be found here. Information about the QoS can be found in the Monthly Service reports. However the QoS is also dependent on the specific path or network segment. For more information on the process to connect to GÉANT can be found on section 2.2.1 of the Draft Functional Specification.

We can deploy already onto AWS and Azure. There is connectivity from those providers to GÉANT, and an agreement allowing users of GÉANT to purchase from these providers at reduced costs. How does this fit together with ARCHIVER?

GEANT has put in place the IaaS Framework where several cloud providers have been selected. Any country and NREN that is part of the framework could purchase via this Framework. ARCHIVER will not buy resources from the GEANT IaaS framework for this project.

Are they cloud providers that are peering with the GEANT network but not part of the IaaS framework?

Yes, such as AWS (only resellers are present in the GEANT IaaS framework), Exoscale (not present in the GÉANT IaaS framework) and T-Systems (present in the framework but for Germany only).

In HNSciCloud, there were some problems with connecting to the GEANT network on the buyers’ side. The interconnect link was not able to generate IP traffic above 5G. Will there be a dedicated circuit to address this in ARCHIVER?

Not initially. Capacity is reserved for R&I institutions (from the two ends of the GEANT backbone). However, the assumption is there is no need for reserved capacity in the GEANT backbone as there is virtually always free capacity available. Reserved capacity might be needed at the end of the network path. Monitoring will be in place by installing PerfSONAR probes.

If a company is peered with an NREN (GRNET), is it sufficient to be connected to the GEANT network?

There are essentially 3 different ways of peering to GÉANT: connection via the NREN, direct peering to GÉANT or connection at Internet Exchange (IX) locations. Please refer to the presentation on network at the event in Stansted for more information: https://www.archiver-project.eu/open-market-consultation-event-london-stansted-airport

 

 

 

 

 


Any further question can be sent to procurement.service@cern.ch using the form contained in Appendix C of the Request for Tenders before 10 March 2020 at 16:00 (expressed in Europe/Zurich time zone).