Data compression¶
lh5.compression gives the user a lot of flexibility in choosing how to compress LGDOs, on disk or in memory, through traditional HDF5 filters or custom waveform compression algorithms.
[1]:
from __future__ import annotations
import lgdo
import numpy as np
import lh5
Let’s start by creating a dummy LGDO Table:
[2]:
data = lgdo.Table(
size=1000,
col_dict={
"col1": lgdo.Array(np.arange(0, 100, 0.1)),
"col2": lgdo.Array(np.random.default_rng().random(1000)),
},
)
data
[2]:
Table(dict={'col1': Array([ 0. 0.1 ... 99.8 99.9], attrs={'datatype': 'array<1>{real}'}), 'col2': Array([0.862781 0.02697042 ... 0.13516798 0.96446903], attrs={'datatype': 'array<1>{real}'})}, attrs={'datatype': 'table{col1,col2}'})
and writing it to disk with default settings:
[3]:
lh5.write(data, "data", "data.lh5", wo_mode="of")
lh5.show("data.lh5")
/
└── data · table{col1,col2}
├── col1 · array<1>{real}
└── col2 · array<1>{real}
Let’s inspect the data on disk:
[4]:
import h5py
def show_h5ds_opts(obj):
with h5py.File("data.lh5") as f:
print(obj)
for attr in ["compression", "compression_opts", "shuffle", "chunks"]:
print(">", attr, ":", f[obj].__getattribute__(attr))
print("> size :", f[obj].id.get_storage_size(), "B")
show_h5ds_opts("data/col1")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B
Looks like the data is compressed with Gzip (compression level 4) by default! This default setting is stored in the global lh5.io.settings.DEFAULT_HDF5_SETTINGS variable:
[5]:
lh5.io.settings.DEFAULT_HDF5_SETTINGS
[5]:
{'shuffle': True, 'compression': 'gzip'}
Which specifies the default keyword arguments forwarded to h5py.Group.create_dataset() and can be overridden by the user.
Important: do not import DEFAULT_HDF5_SETTINGS in your namespace, import lh5.io.settings and modify lh5.io.settings.DEFAULT_HDF5_SETTINGS. Otherwise, changes won’t have any effect.
Examples:
[6]:
# use another built-in filter
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": "lzf"}
# specify filter name and options
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": "gzip", "compression_opts": 7}
# specify a registered filter provided by hdf5plugin
import hdf5plugin
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"compression": hdf5plugin.Blosc()}
# shuffle bytes before compressing (typically better compression ratio with no performance penalty)
lh5.io.settings.DEFAULT_HDF5_SETTINGS = {"shuffle": True, "compression": "lzf"}
Useful resources and lists of HDF5 filters:
Let’s now re-write the data with the updated default settings:
[7]:
lh5.write(data, "data", "data.lh5", wo_mode="of")
show_h5ds_opts("data/col1")
data/col1
> compression : lzf
> compression_opts : None
> shuffle : True
> chunks : (1000,)
> size : 675 B
Nice. Shuffling bytes before compressing significantly reduced size on disk.
To reset the HDF5 settings to default values:
[8]:
lh5.io.settings.DEFAULT_HDF5_SETTINGS = lh5.io.settings.default_hdf5_settings()
Last but not least, create_dataset() keyword arguments can be passed to write(). They will be forwarded as is, overriding default settings.
[9]:
lh5.write(data, "data", "data.lh5", wo_mode="of", shuffle=True, compression="gzip")
show_h5ds_opts("data/col1")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B
Object-specific compression settings are supported via the hdf5_settings LGDO attribute:
[10]:
data["col2"].attrs["hdf5_settings"] = {"compression": "gzip"}
lh5.write(data, "data", "data.lh5", wo_mode="of")
show_h5ds_opts("data/col1")
show_h5ds_opts("data/col2")
data/col1
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 494 B
data/col2
> compression : gzip
> compression_opts : 4
> shuffle : True
> chunks : (1000,)
> size : 7032 B
We are now storing table columns with different compression settings.
Note: since any h5py.Group.create_dataset() keyword argument can be used in write() or set in the hdf5_settings attribute, other HDF5 dataset settings can be configured, like the chunk size.
[11]:
lh5.write(data, "data", "data.lh5", wo_mode="of", chunks=2)
Waveform compression¶
lh5.compression implements fast custom waveform compression routines in the lh5.compression subpackage.
Let’s try them out on some waveform test data:
[12]:
from legendtestdata import LegendTestData
ldata = LegendTestData()
wfs = lh5.read(
"geds/raw/waveform",
ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5"),
)
wfs
[12]:
WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})
Let’s encode the waveform values with the RadwareSigcompress codec.
Note: samples from these test waveforms must be shifted by -32768 for compatibility reasons, see lgdo.compression.radware.encode().
[13]:
from lh5.compression import RadwareSigcompress, encode
enc_values = encode(wfs.values, RadwareSigcompress(codec_shift=-32768))
enc_values
[13]:
ArrayOfEncodedEqualSizedArrays(encoded_data=VectorOfVectors(flattened_data=Array([0x15 0xd8 ... 0x8b 0xa4], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([6584 12528 ... 609112 615660], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), decoded_size=Scalar(value=5592, attrs={'datatype': 'real'}), attrs={'datatype': 'array_of_encoded_equalsized_arrays<1,1>{real}', 'codec': 'radware_sigcompress', 'codec_shift': -32768})
The output LGDO is an ArrayOfEncodedEqualSizedArrays, which is basically an array of bytes representing the compressed data. How big is this compressed object in bytes?
[14]:
enc_values.encoded_data.flattened_data.nda.nbytes
[14]:
615660
How big was the original data structure?
[15]:
wfs.values.nda.nbytes
[15]:
1118400
It shrank quite a bit!
Let’s now make a WaveformTable object wrapping these encoded values, instead of the uncompressed ones, and dump it to disk.
[16]:
enc_wfs = lgdo.WaveformTable(
values=enc_values,
t0=wfs.t0,
dt=wfs.dt,
)
lh5.write(enc_wfs, "waveforms", "data.lh5", wo_mode="o")
lh5.show("data.lh5", attrs=True)
/
├── data · table{col1,col2}
│ ├── col1 · array<1>{real}
│ └── col2 · array<1>{real}
└── waveforms · table{dt,t0,values}
├── dt · array<1>{real} ── {'units': 'ns'}
├── t0 · array<1>{real} ── {'units': 'ns'}
└── values · array_of_encoded_equalsized_arrays<1,1>{real} ── {'codec': 'radware_sigcompress', 'codec_shift': np.int64(-32768)}
├── decoded_size · real
└── encoded_data · array<1>{array<1>{real}}
├── cumulative_length · array<1>{real}
└── flattened_data · array<1>{real}
The LH5 structure is more complex now. Note how the compression settings are stored as HDF5 attributes.
Warning: HDF5 compression is never applied to waveforms compressed with these custom filters.
Let’s try to read the data back in memory:
[17]:
obj = lh5.read("waveforms", "data.lh5")
obj.values
[17]:
ArrayOfEqualSizedArrays([[13712 13712 13683 ... 15445 15380 15400]
[13072 13072 12992 ... 18842 18819 18806]
[13575 13575 13496 ... 18409 18384 18457]
...
[15405 15405 15366 ... 36208 36214 36233]
[9962 9962 9949 ... 13269 13326 13269]
[16918 16918 16962 ... 18715 18877 18933]], attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'})
Wait, this is not the compressed data we just wrote to disk, it got decompressed on the fly! It’s still possible to just return the compressed data though:
[18]:
obj = lh5.read("waveforms", "data.lh5", decompress=False)
obj.values
[18]:
ArrayOfEncodedEqualSizedArrays(encoded_data=VectorOfVectors(flattened_data=Array([0x15 0xd8 ... 0x8b 0xa4], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([6584 12528 ... 609112 615660], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), decoded_size=Scalar(value=np.int64(5592), attrs={'datatype': 'real'}), attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_encoded_equalsized_arrays<1,1>{real}'})
And then decompress it manually:
[19]:
from lh5.compression import decode
decode(obj.values)
[19]:
ArrayOfEqualSizedArrays([[13712 13712 13683 ... 15445 15380 15400]
[13072 13072 12992 ... 18842 18819 18806]
[13575 13575 13496 ... 18409 18384 18457]
...
[15405 15405 15366 ... 36208 36214 36233]
[9962 9962 9949 ... 13269 13326 13269]
[16918 16918 16962 ... 18715 18877 18933]], attrs={'codec': 'radware_sigcompress', 'codec_shift': -32768, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'})
Waveform compression settings can also be specified at the LGDO level by attaching a compression attribute to the values attribute of a WaveformTable object:
[20]:
from lh5.compression import ULEB128ZigZagDiff
wfs.values.attrs["compression"] = ULEB128ZigZagDiff()
lh5.write(wfs, "waveforms", "data.lh5", wo_mode="of")
obj = lh5.read("waveforms", "data.lh5", decompress=False)
obj.values.attrs["codec"]
[20]:
'uleb128_zigzag_diff'
Further reading:
This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the legend-lh5io repository.