Probabilistic circuit for lossless HEP data compression
Short description of the project
Neural data compression is an efficient solution for reducing the cost and computational resources of data storage in many LHC experiments. However, it suffers from the ability to precisely reconstruct compressed data, as most of the neural compression algorithms perform the decompression with the information loosage. On another hand, the lossless neural data compression schemas (VAE, IDF) have a lower compression ratio and are not fast enough for file IO. This project’s task is to overcome the disadvantages of the neural compression algorithm by using the probabilistic circuit for HEP data compression.
Task ideas
- Implement the probabilistic circuit using the PyTorch
- Train and compress the HEP data (Higgs data, TopQuark Dataset)
- Measure the cost and quantify the optimal compression ratio of the probabilistic circuit
- Perform the benchmark, and compare the results with AE, Transformer
Expected results
An improved compression performance with documentation and figures of merit that may include:
- Implemented model of the probabilistic circuit
- Documentation of the benchmark and experiment of compression of the HEP data
Requirements
Required: Good knowledge of UNIX, Python, matplotlib, Pytorch, Julia, Pandas, ROOT.
Links
-
Previous work:
- ROOT
- Jupyter
- Lossless compression with probabilistic circuits
- iFlow: Numerically Invertible Flows for Efficient Lossless Compression via a Uniform Coder
- Integer Discrete Flows and Lossless Compression
Mentors
- Leonid Didukh
- Caterina Doglioni - CERN
Additional Information
- Difficulty level (low / medium / high): medium
- Duration: 350 hours
- Mentor availability: June-October (with 3 weeks mentor vacation where student will work independently with minimal guidance)