CERN: a unique laboratory for scientific Big Data

By Frédéric Milliot | January 01, 2014

This article is part of our feature story: How CERN manages its data

Research at CERN has implications that go far beyond basic physics. Remember 1989: it was a CERN researcher, Tim Berners-Lee, who invented the World Wide Web to enable the remote sharing of information between scientists. More recently, the harnessing of particle acceleration gave birth to PET scans found in a growing number of hospitals, and also cutting-edge nuclear material detection equipment.

Today, given the volume of data produced by the various facilities of the Centre, CERN also stands out as a pioneer in Big Data management. This article is not short of numbers, but if there is one that should not be overlooked, it’s the 25 Petabytes of usable data generated last year only – and this is after 99% of the information from the LHC detectors was not considered pertinent enough to be stored.

Fig. 1 – A block diagram of POOL, the persistence framework for experimental data objects. Note the extensive use of flat ROOT files, containers around which CERN has standardized its operations. (CERN document)

CERN has opened its library of information to researchers from about 150 sites worldwide. Obviously, this amazing feat requires raw experimental data to be stored outside of a relational DBMS. They reside in ROOT files, which are particularly well-suited for scientific analysis thanks to their unique consistency model (write once), while persistence is managed by a custom framework called POOL (Pool of Persistent Objects for LHC). It’s the metadata on this treasury that are managed via an Oracle 11gR2 database, therefore in transactional consistency.

Typically, CERN’s raw data is analyzed in batch mode. Everyone knows that this type of processing can quickly become time- and resource-consuming. That is why the Centre’s data scientists are working on different ways to optimize not only access but also requests. These efforts are undertaken together with various big players (Intel, HP, Oracle…), within the frame of the OpenLab project (FP7). The specific goal is to accelerate the speed of data partition in such a way that it correlates with the increase in production volumes.

Among the paths studied are NoSQL technologies such as Hadoop and Amazon’s Dynamo DBMS, which appear to be the most promising. Why? Firstly because they scale up particularly well, especially in terms of distribution on a large number of clusters. Secondly because, as such, they free the team from managing the complexity of relational databases. Thirdly because they lend themselves perfectly to storage on commodity servers – standard equipment being always preferred by CERN’s IT managers.

The work going on today focuses on the development of procedures allowing the use of links within the Oracle databases in order to extract pertinent information, to process it within Hadoop or as pure business intelligence (as businesses have been doing for a long time) and then to replace it within other Oracle databases. Returning to Oracle has several benefits among which the robustness of the system, its own functional capabilities and the fact that it is universally known and mastered. Concerning the most commonly cited flaws, namely the required processing and storage capabilities, CERN does not consider them to be insurmountable. Taking into account the decreasing costs of specialized cloud storage, the usability benefit seems justified.

Let’s not forget either that the bulk of the LHC data analysis is realized by a global network comprised of more than 150 computing centers (Worldwide LHC Computing Grid – WCLG), which are constantly augmenting capacity. Add to this a bud-get devoted to the use of external cloud services such as Amazon S3 and the strategy makes sense. We know that one of the major obstacles on the way to the next scale is the speed at which data can be transferred from one server to another. In this respect, researching ways to save time and extend in-situ processing functionalities will probably benefit the entire community.

How CERN manages its data
read back | read on