In the latest years scientific projects and research laboratories have produced an increasing amount of data, as outcome of simulated experiments, streaming coming from sensors, results of refinement processes such as data mining, map-reduce, etc. This incredibly high number of digital objects, which is often referred as data deluge, poses serious challenges and they will be even more serious in the next future. Consider, for example in the field of high energy phisycs, an experiment like the Large Hadron Collider, which will generate 15 PB per years: it is a big amount, but it appears little, if compared with the expected outcome of neuroscientists's experiments, who have set themselves the goal of creating a connectome, a complete map of the brain's neural circuitry. In fact the size of a complete human brain neural map should be of the order of magnitude of thousand exabytes. Such perspective has created awareness about the importance of the data management, because a high number of pieces of information without proper metadata, contextualization or assignment of persistent identifiers is useless. They cannot be shared, cannot be combined with other data sets or safely stored and recovered after many years. Therefore we have decided to address the growing number of scientific communities requests, creating a project able to tackle these important challenges and the result is the new Data Repository service.
Thus the CINECA Data Repository is a service to store and maintain scientific data sets, built in a way that allows a user to safely back-up data and at the same time manage them through a variety of clients, such as web browser, graphical desktop and command line interfaces. The service is implemented through iRODS (integrated Rule-Oriented Data System, https://www.irods.org), which relies on plain filesystems as back-end to store data, represented as collections of objects, and on databases for metadata. The service's architecture has been carefully designed to scale to millions of files and petabytes of data, joining robustness and versatility, and to offer to the scientific communities a complete set of features to manage the data life-cycle:
Upload/Download: the system supports high performance transfer protocols like GridFTP, or iRODS multi-threads transfer mechanism, and large interoperable ones like HTTP.
Metadata management: each object can be associated to specific metadata represented as triplets (name,value,unit), or simply tagged and commented. This operation can be performed at any time, not just before the first upload.
Preservation: the long-term accessibility is granted by means of a seamless archiving process, which is able to move the collections of data from the on-line storage space to a tape based off-line space and back, according to general or per-project policies.
Stage-in/stage-out: the service is enabled to move data sets, requested as input for computations, towards the HPC machines' local storage space, commonly named “scratch”, and backwards as soon as the results are available.
Sharing: the capability to share single data objects or whole collections is implemented via a unix-like ownership model, which allows to make them accessible to single users or groups. Moreover a ticket based approach is used to provide temporary access tokens with limited rights.
Searching: the data are indexed and the searches can be based on the objects location or on the associated metadata.
The flexibility provided by such system can be extended with custom solutions thanks to a rule engine embedded in the service and it is complemented by the support of different authentication protocols, like username/password and certificates. The service will be made available in pre-production in the next weeks for a preliminary set of scientific projects and then moved to production at the beginning of the next year (2013).
For further information about the service please contact us at data-repository@cineca.it or hpc-service@cineca.it.