Handling LH5 data

LEGEND stores its data in HDF5 format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented here.

Reading data from disk

Let’s start by downloading a small test LH5 file with the pylegendtestdata package (it takes a while depending on your internet connection):

[1]:
from __future__ import annotations

from legendtestdata import LegendTestData

ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")

We can use lh5.ls() [docs] to inspect the file contents:

[2]:
import lh5

lh5.ls(lh5_file)
[2]:
['geds']

This particular file contains an HDF5 group (they behave like directories). The second argument of ls() can be used to inspect a group (without the trailing /, only the group name is returned, if existing):

[3]:
lh5.ls(lh5_file, "geds/")  # returns ['geds/raw'], which is a group again
lh5.ls(lh5_file, "geds/raw/")
[3]:
['geds/raw/baseline',
 'geds/raw/channel',
 'geds/raw/energy',
 'geds/raw/ievt',
 'geds/raw/numtraces',
 'geds/raw/packet_id',
 'geds/raw/timestamp',
 'geds/raw/tracelist',
 'geds/raw/waveform',
 'geds/raw/wf_max',
 'geds/raw/wf_std']

Note: Alternatively to ls(), show() [docs] prints a nice representation of the LH5 file contents (with LGDO types) on screen:

[4]:
lh5.show(lh5_file)
/
└── geds · HDF5 group
    └── raw · table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}
        ├── baseline · array<1>{real}
        ├── channel · array<1>{real}
        ├── energy · array<1>{real}
        ├── ievt · array<1>{real}
        ├── numtraces · array<1>{real}
        ├── packet_id · array<1>{real}
        ├── timestamp · array<1>{real}
        ├── tracelist · array<1>{array<1>{real}}
        │   ├── cumulative_length · array<1>{real}
        │   └── flattened_data · array<1>{real}
        ├── waveform · table{t0,dt,values}
        │   ├── dt · array<1>{real}
        │   ├── t0 · array<1>{real}
        │   └── values · array_of_equalsized_arrays<1,1>{real}
        ├── wf_max · array<1>{real}
        └── wf_std · array<1>{real}

The group contains several LGDOs. Let’s read them in memory. read() [docs] reads an LGDO from disk and returns the object in memory. Let’s try to read geds/raw:

[5]:
lh5.read("geds/raw", lh5_file)
[5]:
Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})

As shown by the type signature, it is interpreted as a Table with 100 rows. Its contents (or “columns”) can be therefore viewed as LGDO objects of the same length. For example timestamp:

[6]:
lh5.read("geds/raw/timestamp", lh5_file)
[6]:
Array([0.79465985 0.7968994  0.79960424 ... 0.97331905 0.974689
       0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'})

is an LGDO Array with 100 elements.

read() also allows to perform more advanced data reading. For example, let’s read only rows from 15 to 25:

[7]:
obj = lh5.read("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)
[0.82679445 0.8307392  0.8298773  0.830739   0.8339691  0.83487684
 0.83510256 0.83612865 0.83797085 0.8406608 ] with attrs={'units': 's'}

Or, let’s read only columns timestamp and energy from the geds/raw table and rows [1, 3, 7, 9, 10, 15]:

[8]:
obj = lh5.read(
    "geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)
 timestamp  energy
  0.796899    8642
  0.799604   13015
  0.812317   22085
  0.813282   26636
  0.813520    2648
  0.826794    7799

with attrs['timestamp']={'units': 's'}

As you might have noticed, read() loads all the requested data in memory at once. This can be a problem when dealing with large datasets. LH5Iterator [docs] makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:

[9]:
from lh5 import LH5Iterator

for lh5_obj in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
    print(f"energy = {lh5_obj} ({len(lh5_obj)} rows)")
energy = [3304 8642 9177 ... 8289 7091 4084] (20 rows)
energy = [11546  2873  3193 ...  4114 29557 12309] (20 rows)
energy = [ 6455  3302  5314 ... 37333  4262 15131] (20 rows)
energy = [ 6117  3358  3132 ...  2949  3691 10402] (20 rows)
energy = [ 4088 41153 34295 ... 37877  6014  3410] (20 rows)

If working with many files at the same time, theLH5Store [docs] class might come handy:

[10]:
from lh5 import LH5Store

store = LH5Store(
    keep_open=True
)  # with keep_open=True, files are kept open inside the store
store.read("geds/raw", lh5_file)
[10]:
Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})

Have a look at the API reference for more documentation.

There are also some more complex LGDO objects, for example the Histogram. The LH5 data structure of a histogram is fixed and cannot be amended without losing its data type. It also cannot be partially read or streamed.

[11]:
histogram_file = ldata.get_path("lh5/lgdo-histograms.lh5")
histogram = lh5.read("test_histogram_range_w_attrs", histogram_file)
print(histogram)
{
 'axis_0': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
 'axis_1': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
}

But a convenient way to get all necessary details is available, encapsulating the complexity of the underlying structure that is stored in the file, for example (also see the [docs] for all available properties):

[12]:
histogram.binning[0].first, histogram.binning[0].last, histogram.isdensity
[12]:
(np.float64(-5.0), np.float64(5.0), np.False_)

Reading LH5 data to alternative formats

Each LGDO is equipped with a class method called view_as() [docs], which allows the user to “view” the data (i.e. avoiding copying data as much as possible) in a different, third-party format.

LGDOs generally support viewing as NumPy (np), Pandas (pd) or Awkward (ak) data structures, with some exceptions. We strongly recommend having a look at the view_as() API docs of each LGDO type for more details (for Table.view_as() [docs], for example).

Note: To obtain a copy of the data in the selected third-party format, the user can call the appropriate third-party copy method on the view (e.g. pandas.DataFrame.copy(), if viewing the data as a Pandas dataframe).

Let’s play around with our good old table, can we view it as a Pandas dataframe?

[13]:
obj = lh5.read("geds/raw", lh5_file)
df = obj.view_as("pd")
df
[13]:
packet_id ievt timestamp numtraces tracelist baseline energy channel wf_max wf_std waveform_t0 waveform_dt waveform_values
0 1 0 0.794660 1 [53] 13722 3304 53 16352 1028.181519 0.0 16.0 [13712 13712 13683 ... 15445 15380 15400]
1 2 0 0.796899 1 [60] 13044 8642 60 20549 3084.801758 0.0 16.0 [13072 13072 12992 ... 18842 18819 18806]
2 3 0 0.799604 1 [40] 13508 9177 40 20119 2849.593750 0.0 16.0 [13575 13575 13496 ... 18409 18384 18457]
3 4 0 0.799604 1 [41] 11891 13015 41 21215 4023.562256 0.0 16.0 [11862 11862 11808 ... 18946 18861 18834]
4 5 1 0.801617 1 [60] 14353 3794 60 17150 1183.935913 0.0 16.0 [14432 14432 14409 ... 16380 16403 16410]
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 96 30 0.965948 1 [53] 15597 5361 53 29471 947.814392 0.0 16.0 [27818 27818 27755 ... 26594 26562 26493]
96 97 44 0.971051 1 [60] 14599 3748 60 17324 1151.771362 0.0 16.0 [14471 14471 14515 ... 16583 16582 16609]
97 98 31 0.973319 1 [53] 15438 37877 53 42319 11780.090820 0.0 16.0 [15405 15405 15366 ... 36208 36214 36233]
98 99 3 0.974689 1 [30] 9931 6014 30 14317 1876.940308 0.0 16.0 [9962 9962 9949 ... 13269 13326 13269]
99 100 32 0.978621 1 [53] 17013 3410 53 19567 1065.333130 0.0 16.0 [16918 16918 16962 ... 18715 18877 18933]

100 rows × 13 columns

Yes! But how are the nested objects being handled?

Nested tables have been flattened by prefixing their column names with the table object name (obj.waveform.values becomes df.waveform_values) and multi-dimensional columns are represented by Awkward arrays:

[14]:
df.waveform_values
[14]:
0     [13712 13712 13683 ... 15445 15380 15400]
1     [13072 13072 12992 ... 18842 18819 18806]
2     [13575 13575 13496 ... 18409 18384 18457]
3     [11862 11862 11808 ... 18946 18861 18834]
4     [14432 14432 14409 ... 16380 16403 16410]
                        ...
95    [27818 27818 27755 ... 26594 26562 26493]
96    [14471 14471 14515 ... 16583 16582 16609]
97    [15405 15405 15366 ... 36208 36214 36233]
98       [9962 9962 9949 ... 13269 13326 13269]
99    [16918 16918 16962 ... 18715 18877 18933]
Name: waveform_values, Length: 100, dtype: awkward

But what if we wanted to have the waveform values as a NumPy array?

[15]:
obj.waveform.values.view_as("np")
[15]:
array([[13712, 13712, 13683, ..., 15445, 15380, 15400],
       [13072, 13072, 12992, ..., 18842, 18819, 18806],
       [13575, 13575, 13496, ..., 18409, 18384, 18457],
       ...,
       [15405, 15405, 15366, ..., 36208, 36214, 36233],
       [ 9962,  9962,  9949, ..., 13269, 13326, 13269],
       [16918, 16918, 16962, ..., 18715, 18877, 18933]],
      shape=(100, 5592), dtype=uint16)

Can we just view the full table as a huge Awkward array? Of course:

[16]:
obj.view_as("ak")
[16]:
[{packet_id: 1, ievt: 0, timestamp: 0.795, numtraces: 1, tracelist: [53], ...},
 {packet_id: 2, ievt: 0, timestamp: 0.797, numtraces: 1, tracelist: [60], ...},
 {packet_id: 3, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [40], ...},
 {packet_id: 4, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [41], ...},
 {packet_id: 5, ievt: 1, timestamp: 0.802, numtraces: 1, tracelist: [60], ...},
 {packet_id: 6, ievt: 2, timestamp: 0.807, numtraces: 1, tracelist: [60], ...},
 {packet_id: 7, ievt: 3, timestamp: 0.812, numtraces: 1, tracelist: [64], ...},
 {packet_id: 8, ievt: 0, timestamp: 0.812, numtraces: 1, tracelist: [47], ...},
 {packet_id: 9, ievt: 1, timestamp: 0.813, numtraces: 1, tracelist: [53], ...},
 {packet_id: 10, ievt: 4, timestamp: 0.813, numtraces: 1, tracelist: [60], ...},
 ...,
 {packet_id: 92, ievt: 42, timestamp: 0.96, numtraces: 1, tracelist: [60], ...},
 {packet_id: 93, ievt: 43, timestamp: 0.965, numtraces: 1, tracelist: ..., ...},
 {packet_id: 94, ievt: 28, timestamp: 0.963, numtraces: 1, tracelist: ..., ...},
 {packet_id: 95, ievt: 29, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
 {packet_id: 96, ievt: 30, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
 {packet_id: 97, ievt: 44, timestamp: 0.971, numtraces: 1, tracelist: ..., ...},
 {packet_id: 98, ievt: 31, timestamp: 0.973, numtraces: 1, tracelist: ..., ...},
 {packet_id: 99, ievt: 3, timestamp: 0.975, numtraces: 1, tracelist: [30], ...},
 {packet_id: 100, ievt: 32, timestamp: 0.979, numtraces: 1, ...}]
--------------------------------------------------------------------------------
backend: cpu
nbytes: 1.1 MB
type: 100 * {
    packet_id: uint32,
    ievt: int32,
    timestamp: float32,...

Note that viewing a VectorOfVector as an Awkward array is a nearly zero-copy operation and opens a new avenue of fast computational possibilities thanks to Awkward:

[17]:
import awkward as ak

# tracelist is a VoV on disk
trlist = obj.tracelist.view_as("ak")
ak.mean(trlist)
[17]:
np.float64(54.31)

Last but not least, we support attaching physical units (that might be stored in the units attribute of an LGDO) to data views through Pint, if the third-party format allows it:

[18]:
df = obj.view_as("pd", with_units=True)
df.timestamp.dtype
[18]:
pint[s][Float64]

Note that we also provide the read_as() [docs] shortcut to save some typing, for users that would like to read LH5 data on disk straight into some third-party format:

[19]:
lh5.read_as("geds/raw", lh5_file, "pd", with_units=True)
[19]:
packet_id ievt timestamp numtraces tracelist baseline energy channel wf_max wf_std waveform_t0 waveform_dt waveform_values
0 1 0 0.79466 1 [53] 13722 3304 53 16352 1028.181519 0.0 16.0 [13712 13712 13683 ... 15445 15380 15400]
1 2 0 0.796899 1 [60] 13044 8642 60 20549 3084.801758 0.0 16.0 [13072 13072 12992 ... 18842 18819 18806]
2 3 0 0.799604 1 [40] 13508 9177 40 20119 2849.593750 0.0 16.0 [13575 13575 13496 ... 18409 18384 18457]
3 4 0 0.799604 1 [41] 11891 13015 41 21215 4023.562256 0.0 16.0 [11862 11862 11808 ... 18946 18861 18834]
4 5 1 0.801617 1 [60] 14353 3794 60 17150 1183.935913 0.0 16.0 [14432 14432 14409 ... 16380 16403 16410]
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 96 30 0.965948 1 [53] 15597 5361 53 29471 947.814392 0.0 16.0 [27818 27818 27755 ... 26594 26562 26493]
96 97 44 0.971051 1 [60] 14599 3748 60 17324 1151.771362 0.0 16.0 [14471 14471 14515 ... 16583 16582 16609]
97 98 31 0.973319 1 [53] 15438 37877 53 42319 11780.090820 0.0 16.0 [15405 15405 15366 ... 36208 36214 36233]
98 99 3 0.974689 1 [30] 9931 6014 30 14317 1876.940308 0.0 16.0 [9962 9962 9949 ... 13269 13326 13269]
99 100 32 0.978621 1 [53] 17013 3410 53 19567 1065.333130 0.0 16.0 [16918 16918 16962 ... 18715 18877 18933]

100 rows × 13 columns

Some types also support other specialized python libraries. For example, the Histogram [docs] type allows us to easily show the data using the hist package:

[20]:
histogram.view_as("hist")
[20]:
-5 5 -5 5 Axis 0 Axis 1
Regular(20, -5, 5, underflow=False, overflow=False, label='Axis 0')
Regular(20, -5, 5, underflow=False, overflow=False, label='Axis 1')

Double() Σ=5000.0

Note: In this case, a true copy of the data is made. This is a limitation imposed by Hist’s library design.

Histograms are also a example of a LGDO type that does not support all of the usual types for its view_as function (i.e. pd or ak are unsupported).

Writing data to disk

Let’s start by creating some LGDOs:

[21]:
import numpy as np
from lgdo import Array, Scalar, WaveformTable

rng = np.random.default_rng(12345)

scalar = Scalar("made with legend-pydataobj!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))

The write() [docs] function makes it possible to write LGDO objects on disk. Let’s start by writing scalar with name message in a file named my_data.lh5 in the current directory:

[22]:
lh5.write(scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file")

Let’s now inspect the file contents:

[23]:
lh5.show("my_objects.lh5")
/
└── message · string

The string object has been written at the root of the file /. Let’s now write also array and wf_table, this time in a HDF5 group called closet:

[24]:
lh5.write(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
lh5.write(wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5")
lh5.show("my_objects.lh5")
/
├── closet · struct{numbers}
│   ├── numbers · array<1>{real}
│   └── waveforms · table{dt,t0,values}
│       ├── dt · array<1>{real}
│       ├── t0 · array<1>{real}
│       └── values · array_of_equalsized_arrays<1,1>{real}
└── message · string

Everything looks right!

Note: lh5.write() allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [docs] for more information.


This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the legend-lh5io repository.