Handling LH5 data¶

LEGEND stores its data in HDF5 format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented here.

Reading data from disk¶

Let’s start by downloading a small test LH5 file with the pylegendtestdata package (it takes a while depending on your internet connection):

[1]:

from __future__ import annotations

from legendtestdata import LegendTestData

ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")

We can use lh5.ls() [docs] to inspect the file contents:

[2]:

import lh5

lh5.ls(lh5_file)

[2]:

['geds']

This particular file contains an HDF5 group (they behave like directories). The second argument of ls() can be used to inspect a group (without the trailing /, only the group name is returned, if existing):

[3]:

lh5.ls(lh5_file, "geds/")  # returns ['geds/raw'], which is a group again
lh5.ls(lh5_file, "geds/raw/")

[3]:

['geds/raw/baseline',
 'geds/raw/channel',
 'geds/raw/energy',
 'geds/raw/ievt',
 'geds/raw/numtraces',
 'geds/raw/packet_id',
 'geds/raw/timestamp',
 'geds/raw/tracelist',
 'geds/raw/waveform',
 'geds/raw/wf_max',
 'geds/raw/wf_std']

Note: Alternatively to ls(), show() [docs] prints a nice representation of the LH5 file contents (with LGDO types) on screen:

[4]:

lh5.show(lh5_file)

/
└── geds · HDF5 group
    └── raw · table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}
        ├── baseline · array<1>{real}
        ├── channel · array<1>{real}
        ├── energy · array<1>{real}
        ├── ievt · array<1>{real}
        ├── numtraces · array<1>{real}
        ├── packet_id · array<1>{real}
        ├── timestamp · array<1>{real}
        ├── tracelist · array<1>{array<1>{real}}
        │   ├── cumulative_length · array<1>{real}
        │   └── flattened_data · array<1>{real}
        ├── waveform · table{t0,dt,values}
        │   ├── dt · array<1>{real}
        │   ├── t0 · array<1>{real}
        │   └── values · array_of_equalsized_arrays<1,1>{real}
        ├── wf_max · array<1>{real}
        └── wf_std · array<1>{real}

The group contains several LGDOs. Let’s read them in memory. read() [docs] reads an LGDO from disk and returns the object in memory. Let’s try to read geds/raw:

[5]:

lh5.read("geds/raw", lh5_file)

[5]:

Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})

As shown by the type signature, it is interpreted as a Table with 100 rows. Its contents (or “columns”) can be therefore viewed as LGDO objects of the same length. For example timestamp:

[6]:

lh5.read("geds/raw/timestamp", lh5_file)

[6]:

Array([0.79465985 0.7968994  0.79960424 ... 0.97331905 0.974689
       0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'})

is an LGDO Array with 100 elements.

read() also allows to perform more advanced data reading. For example, let’s read only rows from 15 to 25:

[7]:

obj = lh5.read("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)

[0.82679445 0.8307392  0.8298773  0.830739   0.8339691  0.83487684
 0.83510256 0.83612865 0.83797085 0.8406608 ] with attrs={'units': 's'}

Or, let’s read only columns timestamp and energy from the geds/raw table and rows [1, 3, 7, 9, 10, 15]:

[8]:

obj = lh5.read(
    "geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)

 timestamp  energy
  0.796899    8642
  0.799604   13015
  0.812317   22085
  0.813282   26636
  0.813520    2648
  0.826794    7799

with attrs['timestamp']={'units': 's'}

As you might have noticed, read() loads all the requested data in memory at once. This can be a problem when dealing with large datasets. LH5Iterator [docs] makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:

[9]:

from lh5 import LH5Iterator

for lh5_obj in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
    print(f"energy = {lh5_obj} ({len(lh5_obj)} rows)")

energy = [3304 8642 9177 ... 8289 7091 4084] (20 rows)
energy = [11546  2873  3193 ...  4114 29557 12309] (20 rows)
energy = [ 6455  3302  5314 ... 37333  4262 15131] (20 rows)
energy = [ 6117  3358  3132 ...  2949  3691 10402] (20 rows)
energy = [ 4088 41153 34295 ... 37877  6014  3410] (20 rows)

If working with many files at the same time, theLH5Store [docs] class might come handy:

[10]:

from lh5 import LH5Store

store = LH5Store(
    keep_open=True
)  # with keep_open=True, files are kept open inside the store
store.read("geds/raw", lh5_file)

[10]:

Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})

Have a look at the API reference for more documentation.

There are also some more complex LGDO objects, for example the Histogram. The LH5 data structure of a histogram is fixed and cannot be amended without losing its data type. It also cannot be partially read or streamed.

[11]:

histogram_file = ldata.get_path("lh5/lgdo-histograms.lh5")
histogram = lh5.read("test_histogram_range_w_attrs", histogram_file)
print(histogram)

{
 'axis_0': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
 'axis_1': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
}

But a convenient way to get all necessary details is available, encapsulating the complexity of the underlying structure that is stored in the file, for example (also see the [docs] for all available properties):

[12]:

histogram.binning[0].first, histogram.binning[0].last, histogram.isdensity

[12]:

(np.float64(-5.0), np.float64(5.0), np.False_)

Reading LH5 data to alternative formats¶

Each LGDO is equipped with a class method called view_as() [docs], which allows the user to “view” the data (i.e. avoiding copying data as much as possible) in a different, third-party format.

LGDOs generally support viewing as NumPy (np), Pandas (pd) or Awkward (ak) data structures, with some exceptions. We strongly recommend having a look at the view_as() API docs of each LGDO type for more details (for Table.view_as() [docs], for example).

Note: To obtain a copy of the data in the selected third-party format, the user can call the appropriate third-party copy method on the view (e.g. pandas.DataFrame.copy(), if viewing the data as a Pandas dataframe).

Let’s play around with our good old table, can we view it as a Pandas dataframe?

[13]:

obj = lh5.read("geds/raw", lh5_file)
df = obj.view_as("pd")
df

[13]:

	packet_id	ievt	timestamp	numtraces	tracelist	baseline	energy	channel	wf_max	wf_std	waveform_t0	waveform_dt	waveform_values
0	1	0	0.794660	1	[53]	13722	3304	53	16352	1028.181519	0.0	16.0	[13712 13712 13683 ... 15445 15380 15400]
1	2	0	0.796899	1	[60]	13044	8642	60	20549	3084.801758	0.0	16.0	[13072 13072 12992 ... 18842 18819 18806]
2	3	0	0.799604	1	[40]	13508	9177	40	20119	2849.593750	0.0	16.0	[13575 13575 13496 ... 18409 18384 18457]
3	4	0	0.799604	1	[41]	11891	13015	41	21215	4023.562256	0.0	16.0	[11862 11862 11808 ... 18946 18861 18834]
4	5	1	0.801617	1	[60]	14353	3794	60	17150	1183.935913	0.0	16.0	[14432 14432 14409 ... 16380 16403 16410]
...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	96	30	0.965948	1	[53]	15597	5361	53	29471	947.814392	0.0	16.0	[27818 27818 27755 ... 26594 26562 26493]
96	97	44	0.971051	1	[60]	14599	3748	60	17324	1151.771362	0.0	16.0	[14471 14471 14515 ... 16583 16582 16609]
97	98	31	0.973319	1	[53]	15438	37877	53	42319	11780.090820	0.0	16.0	[15405 15405 15366 ... 36208 36214 36233]
98	99	3	0.974689	1	[30]	9931	6014	30	14317	1876.940308	0.0	16.0	[9962 9962 9949 ... 13269 13326 13269]
99	100	32	0.978621	1	[53]	17013	3410	53	19567	1065.333130	0.0	16.0	[16918 16918 16962 ... 18715 18877 18933]

100 rows × 13 columns

Yes! But how are the nested objects being handled?

Nested tables have been flattened by prefixing their column names with the table object name (obj.waveform.values becomes df.waveform_values) and multi-dimensional columns are represented by Awkward arrays:

[14]:

df.waveform_values

[14]:

0     [13712 13712 13683 ... 15445 15380 15400]
1     [13072 13072 12992 ... 18842 18819 18806]
2     [13575 13575 13496 ... 18409 18384 18457]
3     [11862 11862 11808 ... 18946 18861 18834]
4     [14432 14432 14409 ... 16380 16403 16410]
                        ...
95    [27818 27818 27755 ... 26594 26562 26493]
96    [14471 14471 14515 ... 16583 16582 16609]
97    [15405 15405 15366 ... 36208 36214 36233]
98       [9962 9962 9949 ... 13269 13326 13269]
99    [16918 16918 16962 ... 18715 18877 18933]
Name: waveform_values, Length: 100, dtype: awkward

But what if we wanted to have the waveform values as a NumPy array?

[15]:

obj.waveform.values.view_as("np")

[15]:

array([[13712, 13712, 13683, ..., 15445, 15380, 15400],
       [13072, 13072, 12992, ..., 18842, 18819, 18806],
       [13575, 13575, 13496, ..., 18409, 18384, 18457],
       ...,
       [15405, 15405, 15366, ..., 36208, 36214, 36233],
       [ 9962,  9962,  9949, ..., 13269, 13326, 13269],
       [16918, 16918, 16962, ..., 18715, 18877, 18933]],
      shape=(100, 5592), dtype=uint16)

Can we just view the full table as a huge Awkward array? Of course:

[16]:

obj.view_as("ak")

[16]:

[{packet_id: 1, ievt: 0, timestamp: 0.795, numtraces: 1, tracelist: [53], ...},
 {packet_id: 2, ievt: 0, timestamp: 0.797, numtraces: 1, tracelist: [60], ...},
 {packet_id: 3, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [40], ...},
 {packet_id: 4, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [41], ...},
 {packet_id: 5, ievt: 1, timestamp: 0.802, numtraces: 1, tracelist: [60], ...},
 {packet_id: 6, ievt: 2, timestamp: 0.807, numtraces: 1, tracelist: [60], ...},
 {packet_id: 7, ievt: 3, timestamp: 0.812, numtraces: 1, tracelist: [64], ...},
 {packet_id: 8, ievt: 0, timestamp: 0.812, numtraces: 1, tracelist: [47], ...},
 {packet_id: 9, ievt: 1, timestamp: 0.813, numtraces: 1, tracelist: [53], ...},
 {packet_id: 10, ievt: 4, timestamp: 0.813, numtraces: 1, tracelist: [60], ...},
 ...,
 {packet_id: 92, ievt: 42, timestamp: 0.96, numtraces: 1, tracelist: [60], ...},
 {packet_id: 93, ievt: 43, timestamp: 0.965, numtraces: 1, tracelist: ..., ...},
 {packet_id: 94, ievt: 28, timestamp: 0.963, numtraces: 1, tracelist: ..., ...},
 {packet_id: 95, ievt: 29, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
 {packet_id: 96, ievt: 30, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
 {packet_id: 97, ievt: 44, timestamp: 0.971, numtraces: 1, tracelist: ..., ...},
 {packet_id: 98, ievt: 31, timestamp: 0.973, numtraces: 1, tracelist: ..., ...},
 {packet_id: 99, ievt: 3, timestamp: 0.975, numtraces: 1, tracelist: [30], ...},
 {packet_id: 100, ievt: 32, timestamp: 0.979, numtraces: 1, ...}]
--------------------------------------------------------------------------------
backend: cpu
nbytes: 1.1 MB
type: 100 * {
    packet_id: uint32,
    ievt: int32,
    timestamp: float32,...

Note that viewing a VectorOfVector as an Awkward array is a nearly zero-copy operation and opens a new avenue of fast computational possibilities thanks to Awkward:

[17]:

import awkward as ak

# tracelist is a VoV on disk
trlist = obj.tracelist.view_as("ak")
ak.mean(trlist)

[17]:

np.float64(54.31)

Last but not least, we support attaching physical units (that might be stored in the units attribute of an LGDO) to data views through Pint, if the third-party format allows it:

[18]:

df = obj.view_as("pd", with_units=True)
df.timestamp.dtype

[18]:

pint[s][Float64]

Note that we also provide the read_as() [docs] shortcut to save some typing, for users that would like to read LH5 data on disk straight into some third-party format:

[19]:

lh5.read_as("geds/raw", lh5_file, "pd", with_units=True)

[19]:

	packet_id	ievt	timestamp	numtraces	tracelist	baseline	energy	channel	wf_max	wf_std	waveform_t0	waveform_dt	waveform_values
0	1	0	0.79466	1	[53]	13722	3304	53	16352	1028.181519	0.0	16.0	[13712 13712 13683 ... 15445 15380 15400]
1	2	0	0.796899	1	[60]	13044	8642	60	20549	3084.801758	0.0	16.0	[13072 13072 12992 ... 18842 18819 18806]
2	3	0	0.799604	1	[40]	13508	9177	40	20119	2849.593750	0.0	16.0	[13575 13575 13496 ... 18409 18384 18457]
3	4	0	0.799604	1	[41]	11891	13015	41	21215	4023.562256	0.0	16.0	[11862 11862 11808 ... 18946 18861 18834]
4	5	1	0.801617	1	[60]	14353	3794	60	17150	1183.935913	0.0	16.0	[14432 14432 14409 ... 16380 16403 16410]
...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	96	30	0.965948	1	[53]	15597	5361	53	29471	947.814392	0.0	16.0	[27818 27818 27755 ... 26594 26562 26493]
96	97	44	0.971051	1	[60]	14599	3748	60	17324	1151.771362	0.0	16.0	[14471 14471 14515 ... 16583 16582 16609]
97	98	31	0.973319	1	[53]	15438	37877	53	42319	11780.090820	0.0	16.0	[15405 15405 15366 ... 36208 36214 36233]
98	99	3	0.974689	1	[30]	9931	6014	30	14317	1876.940308	0.0	16.0	[9962 9962 9949 ... 13269 13326 13269]
99	100	32	0.978621	1	[53]	17013	3410	53	19567	1065.333130	0.0	16.0	[16918 16918 16962 ... 18715 18877 18933]

100 rows × 13 columns

Some types also support other specialized python libraries. For example, the Histogram [docs] type allows us to easily show the data using the hist package:

[20]:

histogram.view_as("hist")

[20]:

Regular(20, -5, 5, underflow=False, overflow=False, label='Axis 0')
Regular(20, -5, 5, underflow=False, overflow=False, label='Axis 1')

Double() Σ=5000.0

Note: In this case, a true copy of the data is made. This is a limitation imposed by Hist’s library design.

Histograms are also a example of a LGDO type that does not support all of the usual types for its view_as function (i.e. pd or ak are unsupported).

Writing data to disk¶

Let’s start by creating some LGDOs:

[21]:

import numpy as np
from lgdo import Array, Scalar, WaveformTable

rng = np.random.default_rng(12345)

scalar = Scalar("made with legend-pydataobj!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))

The write() [docs] function makes it possible to write LGDO objects on disk. Let’s start by writing scalar with name message in a file named my_data.lh5 in the current directory:

[22]:

lh5.write(scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file")

Let’s now inspect the file contents:

[23]:

lh5.show("my_objects.lh5")

/
└── message · string

The string object has been written at the root of the file /. Let’s now write also array and wf_table, this time in a HDF5 group called closet:

[24]:

lh5.write(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
lh5.write(wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5")
lh5.show("my_objects.lh5")

/
├── closet · struct{numbers}
│   ├── numbers · array<1>{real}
│   └── waveforms · table{dt,t0,values}
│       ├── dt · array<1>{real}
│       ├── t0 · array<1>{real}
│       └── values · array_of_equalsized_arrays<1,1>{real}
└── message · string

Everything looks right!

Note: lh5.write() allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [docs] for more information.

This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the legend-lh5io repository.