How CERN manages its data

By Frédéric Milliot | January 01, 2014

After 3 years of faithful service, after the more than probable discovery of the Higgs boson, the LHC starts its first long shutdown. The ideal occasion to discover how, on a day-to-day basis, CERN takes up the challenge of scientific Big Data…

This feature story also includes the following articles:
   • Data Models for the LHC
   • A Unique Laboratory for Scientific Big Data
   • 100 Petabytes to Maintain
   • CASTOR, CERN’s Storage Manager

As you read this, more than a thousand researchers all over the world are working live on data from CERN’S LHC (Large Hadron Collider). If the Centre is universally recognized for its frontier work in fundamental physics, it is beginning to be recognized also for its compute infrastructure. With an annual usable production of experimental data exceeding 30 Petabytes, the problem is not simple, all the more as this wealth of knowledge must be shared without constraint of place or technology. The case of CERN being in many respects representative of the problems that academic and private organizations face managing large volumes of information, it seemed appropriate to inquire into the matter. It is to a guided tour of CERN’s data installations that we invite you now.

Big Science > Big Data

“The biggest challenge here in the CERN’s IT department“, points out Frederic Hemmer, the engineer in charge, “is obviously the volume of the data, particularly in production.” The Centre must collect, analyze, store and protect this data to make it available 24/7 to community researchers. In such a context, the choice of an agile infrastructure (that we will detail below) has proven relevant: to date, CERN has never lost a file, for any reason. And this although “even when the LHC is not operating“, Frederic Hemmer emphasizes, “IT does not stop. Our analyses are continuous, here at the Centre and all over the world.”

Concretely, CERN must anticipate the very particular, and often rather inventive, needs of users who conduct experiments for which the technical requirement are generally unpredictable. This leads the IT teams to innovate day after day, with extreme result constraints and an operational budget that remains very… constant.

Fig.1 – In the production line of experimental data, a multilevel filter system eliminates more than 99% of the information collected. Only the remaining information will be stored for analysis. (CERN document)

Oracle everywhere

Freely choosing the basic elements of its technological stack, CERN takes care to respect a theoretical balance between performance, reliability and scalability. For the data, the choice of an Oracle base was made as far back as 1982, and extended since then to all dimensions of the organization and operation of the Centre, including the systems controlling the accelerators during the experiments. According to the technicians responsible for these decisions, Oracle meets the prerequisites as regards functionality, availability and dimensioning. They appreciate that the software infrastructure includes the tools necessary for management, protection and distribution of the data.

On the storage side, NetApp technologies constitute the bulk of the new facilities. For NetApp, such a showcase is not without technical challenges. For example, collisions of heavy ions (lead nuclei, generally) are complex experiments that make information production rates very difficult to estimate in advance, but that can reach 6 GB per second. However, still according to Frederic Hemmer, “data being the very reason for our existence, the mission of the IT teams is multidimensional. We must allow their almost immediate use, guarantee their immortality, manage equipment and software upgrades in a non-disruptive way and make sure that the infrastructure offers virtually infinite space…“

The LHC depends on its DBMS

If such a statement is usually readily admitted, it should be well understood here that even the least problem in collecting or managing data implies stopping the collider. Technically, the collider is actually managed through two databases. The first, ACCCON, stores the setup and control elements of the installation. To ensure permanent monitoring of the LHC, the operators refine its configuration in real time through an array of control monitors grouped in a dedicated room. If the database is unavailable even for a few seconds, the collider becomes uncontrollable. It is therefore necessary to stop the experiment in progress, which implies killing the beams in enormous graphite blocks to disperse their energy and protect the ring. The extreme temperatures are, in fact, liable to damage the magnets (each of which costs more than a million dollars), which would require repairs likely to make the LHC unavailable for months.

The second database, ACCLOG, is a registry of the inputs coming from the thousands of sensors which also make up the LHC. It is this database that contains the long-term logs of the state of the magnets and the moving parts, particularly the collimators that protect the beams by eliminating scattered particles. With more than 4.3 trillion rows, this database is by far the largest (and the one that is growing the fastest) among all of the Centre’s information systems. And of course, as it determines the calibration of the whole infrastructure, it is essential to maintaining the LHC online.

[More]

Numbers unlike anything else…

• 2,500 employees
• 10,000+ researchers and students
• 608 associate universities
• 113 nationalities on site

• 828 racks
• 11,728 servers
• 15,694 processors
• 64,238 cores
• 56,014 memory modules
• 158 TiB* memory capacity
• 64,109 drives
• 63,289 TiB* of raw disk capacity
• 3,749 RAID controllers
• 1,800 drive failures / year
• 160 tape drives
• 45,000 cartridges
• 56,000 cartridge slots
• 73,000 TiB* tape capacity

• 24 high-speed routers (640 Mbps – 2.4 Tbps)
• 350 Ethernet switches
• 2,000 10 Gbps ports

• 2.4556 MW (IT use)
• 120 MW (LHC use)

* 1 TiB (tebibyte) = 2⁴⁰, or 1,099,511,627,776 bytes or 1,024 gibibytes