Probabilistic circuit for lossless HEP data compression

Short description of the project

Neural data compression is an efficient solution for reducing the cost and computational resources of data storage in many LHC experiments. However, it suffers from the ability to precisely reconstruct compressed data, as most of the neural compression algorithms perform the decompression with the information loosage. On another hand, the lossless neural data compression schemas (VAE, IDF) have a lower compression ratio and are not fast enough for file IO. This project’s task is to overcome the disadvantages of the neural compression algorithm by using the probabilistic circuit for HEP data compression.

Task ideas

Implement the probabilistic circuit using the PyTorch
Train and compress the HEP data (Higgs data, TopQuark Dataset)
Measure the cost and quantify the optimal compression ratio of the probabilistic circuit
Perform the benchmark, and compare the results with AE, Transformer

Expected results

An improved compression performance with documentation and figures of merit that may include:

Implemented model of the probabilistic circuit
Documentation of the benchmark and experiment of compression of the HEP data

Requirements

Required: Good knowledge of UNIX, Python, matplotlib, Pytorch, Julia, Pandas, ROOT.

Mentors

Leonid Didukh
Caterina Doglioni - CERN

Additional Information

Difficulty level (low / medium / high): medium
Duration: 350 hours
Mentor availability: June-October (with 3 weeks mentor vacation where student will work independently with minimal guidance)

Corresponding Project

baler

Participating Organizations

CERN