Awkward Array Operations
Description
Most particle physics analysis today is performed by physicists writing programs to traverse nested data structures. These one-time analysis programs suffer from several issues:
- data structures linked with pointers/references are non-sequential in memory;
- dynamic memory allocation (to make nested data) is not even possible on some devices, such as GPUs;
- branchy code (frequent
ifstatements) make poor use of Single Instruction Multiple Data (SIMD) devices; - there’s a steep trade-off between interactive analysis in Python and fast execution in C++.
In other academic fields and in data science, these issues are avoided by expressing analysis logic in SQL or a suite of array operations in MATLAB or Numpy. Particle physics, however, relies crucially on variable-sized, nested data structures that don’t fit neatly into tables or arrays. Every proton collision at the LHC produces a different number of electrons, gluons, and quarks with complex interrelationships.
We have been developing extensions to array programming concepts for nested, heterogeneous, and cross-linked data in a library called awkward-array. This library follows the syntax of Numpy, but for complex structures:
>>> import awkward
>>> array = awkward.fromiter(
[[1.1, 2.2, None, 3.3, None],
[4.4, [5.5]],
[{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]])
>>> (array + 100).tolist()
[[101.1, 102.2, None, 103.3, None],
[104.4, [105.5]],
[{'x': 106, 'y': {'z': 107}}, None, {'x': 108, 'y': {'z': 109}}]]
Like Numpy, a single expression performs calculations across a whole dataset (alleviating the tradeoff between interactivity and performance) that is contiguous by type (column-oriented data) in a way that is fully portable to GPUs. Our set of awkward-array operations is broader than those needed for flat-array processing, and we are discovering new operations by translating traditional particle physics programs into array-centric scripts.
In this project, we would like you to create a library of precompiled awkward-array operations. Our current implementation of awkward-array is built from Numpy primitives, which is portable but not as efficient as dedicated, precompiled routines because each Numpy call makes a separate pass over memory, flushing the CPU cache. The project will focus on good software engineering principles to build a maintainable infrastructure. We don’t expect an optimized implementation of every operation by the end of the summer, just a clearly organized space to put new implementations when we need them.
Task ideas
- Extend the Python package infrastructure to allow layered installation: Numpy-only, precompiled routines for CPUs, and perhaps precompiled routines for GPUs. We expect pybind11 would be a good glue between Python and C++ (and possibly CUDA).
- Develop a robust suite of unit tests to continuously verify equivalence of the implementations.
- Document the organization of the new libraries for future developers.
- Possibly incorporate a SIMD library, such as Vc, VecOps, or xsimd for CPU vectorization.
- Possibly incorporate awkward-array operations into Pandas.
(Not all of the above are possible in one summer.)
Expected results
By the end of the summer, we would like to see a well-established library structure. Even if the set of implemented operations is incomplete, it should be clear how the library will grow and be maintained.
Desirable Skills
- Good software engineering practices; well-organized coding.
- Rigorous testing and documentation.
- Fluency in modern C++ programming and familiarity with Python packaging. (Experience with CUDA/GPU programming is optional.)
- Interest in data structures and optimization. (Experience in code optimization is optional: we’re more interested in well-organized code than fast code.)
Mentors
Links
- awkward-array repository
- Introduction to array-centric analysis for physicists
- Presentation on array-centric analysis for physicists