Fast Merging of RNTuple Data Sets
Description
The ROOT Software Framework is the cornerstone of many software stacks used by High Energy Physics (HEP) experiments, at CERN and other prestigious laboratories. It provides components which are fundamental for the entire data processing chain, from particle collisions to final publications, including final user data analysis, including modern machine learning techniques.
The RNTuple classes provide ROOT’s new, experimental I/O subsystem for high-energy physics data. The RNTuple data layout is columnar and supports nested types (e.g., vectors of floats), conceptually similar to Apache Arrow or Apache Parquet.
A frequent operation is merging of data sets, in particular at the end of distributed data pipelines. For instance, users might use a cluster of nodes to filter a large data set. The input file is split for the individual nodes and each node produces a new RNTuple file containing a subset of its input part. These individual filter outputs should then be merged back into a single file. A naive approach would read each and every row from every file and write it out to a result file. Fast merging, in contrast, makes use of the low level RNTuple data format. Instead of parsing individual rows, data can be merged as a meta-data only operation where only headers and footers are processed but the actual data blocks are copied as is.
The project should, in a first step, develop a fast merge routine for ROOT RNTuple data sets. In a second step, this facility should be used by ROOT’s hadd program, a general merge utility for ROOT files
Task ideas
- Given two RNTuple data sets, implement schema comparison using the header information to determine whether the data sets can be merged
- Given two compatible RNTuple data sets, implement a merge routine that copies the data blocks and adjusts the footer
- In the ROOT hadd utility, add code to identify RNTuple data sets
- Use the RNTuple merger to handle RNTuple data sets in ROOT files processed with hadd
- Stretch goal: investigate fast merging on a copy-on-write file system, where the merged file could be created by cloning the data blocks instead of copying
Expected results
Working implementation of an RNTuple data set merger and additions to hadd in order to correctly handle RNTuples.
Evaluation Task
Interested students please contact Jakob (jblomer@cern.ch) for a warm-up and evaluation task.
Requirements
C++, low-level data storage and access programming