Data compression

lh5.compression gives the user a lot of flexibility in choosing how to compress LGDOs, on disk or in memory, through traditional HDF5 filters or custom waveform compression algorithms.

[1]:
from __future__ import annotations

import lgdo
import numpy as np

import lh5

Let’s start by creating a dummy LGDO Table:

[2]:
data = lgdo.Table(
    size=1000,
    col_dict={
        "col1": lgdo.Array(np.arange(0, 100, 0.1)),
        "col2": lgdo.Array(np.random.default_rng().random(1000)),
    },
)
data
[2]:
Table(dict={'col1': Array([ 0. 0.1 ... 99.8 99.9], attrs={'datatype': 'array<1>{real}'}), 'col2': Array([0.862781 0.02697042 ... 0.13516798 0.96446903], attrs={'datatype': 'array<1>{real}'})}, attrs={'datatype': 'table{col1,col2}'})

and writing it to disk with default settings:

[3]:
lh5.write(data, "data", "data.lh5", wo_mode="of")
lh5.show("data.lh5")
/
└── data · table{col1,col2}
    ├── col1 · array<1>{real}
    └── col2 · array<1>{real}

Let’s inspect the data on disk:

[4]:
import h5py


def show_h5ds_opts(obj):
    with h5py.File("data.lh5") as f:
        print(obj)
        for attr in ["compression", "compression_opts", "shuffle", "chunks"]:
            print(">", attr, ":", f[obj].__getattribute__(attr))
        print("> size :", f[obj].id.get_storage_size(), "B")


show_h5ds_opts("data/col1")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B

Looks like the data is compressed with Gzip (compression level 4) by default! This default setting is stored in the global lh5.io.settings.DEFAULT_HDF5_SETTINGS variable:

[5]:
lh5.io.settings.DEFAULT_HDF5_SETTINGS
[5]:
{'shuffle': True, 'compression': 'gzip'}

Which specifies the default keyword arguments forwarded to h5py.Group.create_dataset() and can be overridden by the user.

Important: do not import DEFAULT_HDF5_SETTINGS in your namespace, import lh5.io.settings and modify lh5.io.settings.DEFAULT_HDF5_SETTINGS. Otherwise, changes won’t have any effect.

Examples:

[6]:
# use another built-in filter
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": "lzf"}

# specify filter name and options
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": "gzip", "compression_opts": 7}

# specify a registered filter provided by hdf5plugin
import hdf5plugin

lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": hdf5plugin.Blosc()}

# shuffle bytes before compressing (typically better compression ratio with no performance penalty)
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"shuffle": True, "compression": "lzf"}

Useful resources and lists of HDF5 filters:

Let’s now re-write the data with the updated default settings:

[7]:
lh5.write(data, "data", "data.lh5", wo_mode="of")
show_h5ds_opts("data/col1")
data/col1
> compression : lzf
> compression_opts : None
> shuffle : True
> chunks : (1000,)
> size : 675 B

Nice. Shuffling bytes before compressing significantly reduced size on disk.

To reset the HDF5 settings to default values:

[8]:
lh5.io.settings.DEFAULT_HDF5_SETTINGS = lh5.io.settings.default_hdf5_settings()

Last but not least, create_dataset() keyword arguments can be passed to write(). They will be forwarded as is, overriding default settings.

[9]:
lh5.write(data, "data", "data.lh5", wo_mode="of", shuffle=True, compression="gzip")
show_h5ds_opts("data/col1")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B

Object-specific compression settings are supported via the hdf5_settings LGDO attribute:

[10]:
data["col2"].attrs["hdf5_settings"] = {"compression": "gzip"}
lh5.write(data, "data", "data.lh5", wo_mode="of")

show_h5ds_opts("data/col1")
show_h5ds_opts("data/col2")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B
data/col2
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 7032 B

We are now storing table columns with different compression settings.

Note: since any h5py.Group.create_dataset() keyword argument can be used in write() or set in the hdf5_settings attribute, other HDF5 dataset settings can be configured, like the chunk size.

[11]:
lh5.write(data, "data", "data.lh5", wo_mode="of", chunks=2)

Waveform compression

lh5.compression implements fast custom waveform compression routines in the lh5.compression subpackage.

Let’s try them out on some waveform test data:

[12]:
from legendtestdata import LegendTestData

ldata = LegendTestData()
wfs = lh5.read(
    "geds/raw/waveform",
    ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5"),
)
wfs
[12]:
WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})

Let’s encode the waveform values with the RadwareSigcompress codec.

Note: samples from these test waveforms must be shifted by -32768 for compatibility reasons, see lgdo.compression.radware.encode().

[13]:
from lh5.compression import RadwareSigcompress, encode

enc_values = encode(wfs.values, RadwareSigcompress(codec_shift=-32768))
enc_values
[13]:
ArrayOfEncodedEqualSizedArrays(encoded_data=VectorOfVectors(flattened_data=Array([0x15 0xd8 ... 0x8b 0xa4], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([6584 12528 ... 609112 615660], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), decoded_size=Scalar(value=5592, attrs={'datatype': 'real'}), attrs={'datatype': 'array_of_encoded_equalsized_arrays<1,1>{real}', 'codec': 'radware_sigcompress', 'codec_shift': -32768})

The output LGDO is an ArrayOfEncodedEqualSizedArrays, which is basically an array of bytes representing the compressed data. How big is this compressed object in bytes?

[14]:
enc_values.encoded_data.flattened_data.nda.nbytes
[14]:
615660

How big was the original data structure?

[15]:
wfs.values.nda.nbytes
[15]:
1118400

It shrank quite a bit!

Let’s now make a WaveformTable object wrapping these encoded values, instead of the uncompressed ones, and dump it to disk.

[16]:
enc_wfs = lgdo.WaveformTable(
    values=enc_values,
    t0=wfs.t0,
    dt=wfs.dt,
)
lh5.write(enc_wfs, "waveforms", "data.lh5", wo_mode="o")
lh5.show("data.lh5", attrs=True)
/
├── data · table{col1,col2}
│   ├── col1 · array<1>{real}
│   └── col2 · array<1>{real}
└── waveforms · table{dt,t0,values}
    ├── dt · array<1>{real} ── {'units': 'ns'}
    ├── t0 · array<1>{real} ── {'units': 'ns'}
    └── values · array_of_encoded_equalsized_arrays<1,1>{real} ── {'codec': 'radware_sigcompress', 'codec_shift': np.int64(-32768)}
        ├── decoded_size · real
        └── encoded_data · array<1>{array<1>{real}}
            ├── cumulative_length · array<1>{real}
            └── flattened_data · array<1>{real}

The LH5 structure is more complex now. Note how the compression settings are stored as HDF5 attributes.

Warning: HDF5 compression is never applied to waveforms compressed with these custom filters.

Let’s try to read the data back in memory:

[17]:
obj = lh5.read("waveforms", "data.lh5")
obj.values
[17]:
ArrayOfEqualSizedArrays([[13712 13712 13683 ... 15445 15380 15400]
                         [13072 13072 12992 ... 18842 18819 18806]
                         [13575 13575 13496 ... 18409 18384 18457]
                         ...
                         [15405 15405 15366 ... 36208 36214 36233]
                         [9962 9962 9949 ... 13269 13326 13269]
                         [16918 16918 16962 ... 18715 18877 18933]], attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'})

Wait, this is not the compressed data we just wrote to disk, it got decompressed on the fly! It’s still possible to just return the compressed data though:

[18]:
obj = lh5.read("waveforms", "data.lh5", decompress=False)
obj.values
[18]:
ArrayOfEncodedEqualSizedArrays(encoded_data=VectorOfVectors(flattened_data=Array([0x15 0xd8 ... 0x8b 0xa4], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([6584 12528 ... 609112 615660], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), decoded_size=Scalar(value=np.int64(5592), attrs={'datatype': 'real'}), attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_encoded_equalsized_arrays<1,1>{real}'})

And then decompress it manually:

[19]:
from lh5.compression import decode

decode(obj.values)
[19]:
ArrayOfEqualSizedArrays([[13712 13712 13683 ... 15445 15380 15400]
                         [13072 13072 12992 ... 18842 18819 18806]
                         [13575 13575 13496 ... 18409 18384 18457]
                         ...
                         [15405 15405 15366 ... 36208 36214 36233]
                         [9962 9962 9949 ... 13269 13326 13269]
                         [16918 16918 16962 ... 18715 18877 18933]], attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'})

Waveform compression settings can also be specified at the LGDO level by attaching a compression attribute to the values attribute of a WaveformTable object:

[20]:
from lh5.compression import ULEB128ZigZagDiff

wfs.values.attrs["compression"] = ULEB128ZigZagDiff()
lh5.write(wfs, "waveforms", "data.lh5", wo_mode="of")

obj = lh5.read("waveforms", "data.lh5", decompress=False)
obj.values.attrs["codec"]
[20]:
'uleb128_zigzag_diff'

Further reading:


This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the legend-lh5io repository.