IRIS-HEP - Uproot-Dask
Description
Uproot is a pure Python reader and writer of the ROOT file format, the primary format for High Energy Physics data. It converts ROOT files on disk to and from NumPy arrays, Awkward Arrays, and Pandas DataFrames for analysis.
Uproot is primarily “eager,” in that a request for ROOT data to be read into memory is executed immediately, with the exception of one function, uproot.lazy. This function is currently only supported by Uproot’s Awkward Array backend, and it relies on the VirtualArray and PartitionedArray array types that are being deprecated in favor of a new dask-awkward collection type. Dask is an industry standard library for delayed and distributed computation in Python.
The goal of this project would be to reimplement uproot.lazy (now uproot.dask) using the new dask.awkward collection. In addition, the new function should also support the NumPy and Pandas backends, leveraging the dask.array and dask.dataframe collection types in Dask.
It is part of a larger project to update Uproot 4 to use Awkward Array version 2, of which the migration to Dask is one part. uproot.dask will be a key feature of Uproot version 5.
Project tasks
As written, these tasks are not sequential: implementation will likely be iterative, and your understanding should grow throughout the process.
- Understand the current implementation of
uproot.lazy, how Uproot reads data from local and remote files, schedules I/O and interpretation tasks internally, and how it uses Awkward version 1 to build a lazy array. - Understand the new
dask.awkwardcollection type, how to substitute the deprecated VirtualArray and PartitionedArray node types. - Understand the metadata available in ROOT files and possibly modernize the uproot3.numentries function to scan a large set of files for metadata.
- Implement the new
uproot.daskfor the NumPy backend (first; should be the easiest of the three). - Implement it for the Awkward Array backend, using all of the above.
- Implement it for DataFrames.
- Test in a variety of cases: only constructing the DAG, synchronous and futures-based evaluation in one core, multithreading, multiprocessing, and distributed. All single-process tests should be added to Uproot’s continuous testing suite.
- Document the new function with inline docstrings, to populate the online reference, and add “how-to” explanations in the Getting started guide.
- Present and/or demo the new function, fully integrated into Dask (with the dashboard and other diagnostics).
- Collaborate with the larger Uproot 5 project.
Expected results
The new uproot.dask function should return a Dask DAG node (high level), suitable for further processing in dask.array, dask.dataframe, or the new dask.awkward. All modes of processing: synchronous, asynchronous, multithreaded, multiprocessing, and distributed, should work without unnecessary copying of data. This should be demonstrated with diagnostics.
Evaluation Task
Interested students please contact Jim (pivarski@princeton.edu) for an evaluation task.
Requirements
- Strong Python skills
- Familiarity with Dask
Mentors
Links
Additional Information
- Difficulty level (low / medium / high): high
- Duration: 350 hours
- Mentor availability: May-September (whole GSoC period)