Awkward Array Operations

Description

Most particle physics analysis today is performed by physicists writing programs to traverse nested data structures. These one-time analysis programs suffer from several issues:

data structures linked with pointers/references are non-sequential in memory;
dynamic memory allocation (to make nested data) is not even possible on some devices, such as GPUs;
branchy code (frequent if statements) make poor use of Single Instruction Multiple Data (SIMD) devices;
there’s a steep trade-off between interactive analysis in Python and fast execution in C++.

In other academic fields and in data science, these issues are avoided by expressing analysis logic in SQL or a suite of array operations in MATLAB or Numpy. Particle physics, however, relies crucially on variable-sized, nested data structures that don’t fit neatly into tables or arrays. Every proton collision at the LHC produces a different number of electrons, gluons, and quarks with complex interrelationships.

We have been developing extensions to array programming concepts for nested, heterogeneous, and cross-linked data in a library called awkward-array. This library follows the syntax of Numpy, but for complex structures:

>>> import awkward

>>> array = awkward.fromiter(
[[1.1, 2.2, None, 3.3, None],
 [4.4, [5.5]],
 [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]])

>>> (array + 100).tolist()
[[101.1, 102.2, None, 103.3, None],
 [104.4, [105.5]],
 [{'x': 106, 'y': {'z': 107}}, None, {'x': 108, 'y': {'z': 109}}]]

Like Numpy, a single expression performs calculations across a whole dataset (alleviating the tradeoff between interactivity and performance) that is contiguous by type (column-oriented data) in a way that is fully portable to GPUs. Our set of awkward-array operations is broader than those needed for flat-array processing, and we are discovering new operations by translating traditional particle physics programs into array-centric scripts.

In this project, we would like you to create a library of precompiled awkward-array operations. Our current implementation of awkward-array is built from Numpy primitives, which is portable but not as efficient as dedicated, precompiled routines because each Numpy call makes a separate pass over memory, flushing the CPU cache. The project will focus on good software engineering principles to build a maintainable infrastructure. We don’t expect an optimized implementation of every operation by the end of the summer, just a clearly organized space to put new implementations when we need them.

Task ideas

Extend the Python package infrastructure to allow layered installation: Numpy-only, precompiled routines for CPUs, and perhaps precompiled routines for GPUs. We expect pybind11 would be a good glue between Python and C++ (and possibly CUDA).
Develop a robust suite of unit tests to continuously verify equivalence of the implementations.
Document the organization of the new libraries for future developers.
Possibly incorporate a SIMD library, such as Vc, VecOps, or xsimd for CPU vectorization.
Possibly incorporate awkward-array operations into Pandas.

(Not all of the above are possible in one summer.)

Expected results

By the end of the summer, we would like to see a well-established library structure. Even if the set of implemented operations is incomplete, it should be clear how the library will grow and be maintained.

Desirable Skills

Good software engineering practices; well-organized coding.
Rigorous testing and documentation.
Fluency in modern C++ programming and familiarity with Python packaging. (Experience with CUDA/GPU programming is optional.)
Interest in data structures and optimization. (Experience in code optimization is optional: we’re more interested in well-organized code than fast code.)

Awkward Array Operations

Description

Task ideas

Expected results

Desirable Skills

Mentors

Links

Corresponding Project

Participating Organizations