CERN is a heavy user of ceph, with about 100PB of data across cephfs, object stores (used as backend for S3), and block storage (mostly for storage for VMs). CVMFS (https://cernvm.cern.ch/fs/) is used to distribute the software stacks used by LHC experiments across the WLCG (Worldwide LHC Computing Grid), and is back by S3 with ceph for its storage needs. Physics data, however, is stored on EOS (https://eos.web.cern.ch) and CERN just recently crossed the 1EB mark of raw disk storage managed by EOS. EOS is also used as the storage solution for CERNBox (https://cernbox.web.cern.ch/), which holds user data. Data analyses use ROOT and read the data remotely from EOS using XRootD (https://github.com/xrootd/xrootd), as EOS is itself based on XRootD. XRootD is very efficient to read data across the network compared to other solutions. It is also used by other experiments beyond high energy physics, for example by LSST in its clustered database called Qserv (https://qserv.lsst.io).
I think this is good advice overall. I wrote a CMake script that does most of the heavy lifting for XRootD (see https://news.ycombinator.com/item?id=39657703). The CI is then a couple of lines, one to install the dependencies using the packaging tools, and another one calling that script. So don't underestimate the convenience that packaging can give you when installing dependencies.
I was just taking a look and couldn't help but notice the switch statement for your operator[], which likely causes a lot of unnecessary bad speculation at runtime:
Many believe the C++ compiler will magically optimize the switch away, but in some cases, like the example above for CLHEP, it doesn't happen, so you end up with bad performance.
It was a nice guest post on the website about eclipse, but most people just use gdb. It is now possible to step through ROOT macros with gdb by exporting CLING_DEBUG=1. See https://indico.jlab.org/event/459/contributions/11563/
> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.
There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.
Not only that, open source and proprietary software both generally handle the common case well, because otherwise nobody would use it.
It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.
Your options are to use off the shelf and end up with a brittle and janky setup, or use open source and end up with a brittle and janky setup that is more customized to your workflows... It's a tradeoff though, and all the hosting and security work of open source can be a huge time sink.
You don't actually have to do any of that work if you don't want to. Half the open source software companies have that as their business model -- you can take the code and do it yourself or you can buy a support contract and they do it for you. But then you can make your own modifications even if you're paying someone to handle the rest of it.
reply