Handling LH5 data¶
LEGEND stores its data in HDF5 format, a high-performance data format becoming popular in experimental physics. LEGEND Data Objects (LGDO) are represented as HDF5 objects according to a custom specification, documented here.
Reading data from disk¶
Let’s start by downloading a small test LH5 file with the pylegendtestdata package (it takes a while depending on your internet connection):
[1]:
from __future__ import annotations
from legendtestdata import LegendTestData
ldata = LegendTestData()
lh5_file = ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5")
We can use lh5.ls() [docs] to inspect the file contents:
[2]:
import lh5
lh5.ls(lh5_file)
[2]:
['geds']
This particular file contains an HDF5 group (they behave like directories). The second argument of ls() can be used to inspect a group (without the trailing /, only the group name is returned, if existing):
[3]:
lh5.ls(lh5_file, "geds/") # returns ['geds/raw'], which is a group again
lh5.ls(lh5_file, "geds/raw/")
[3]:
['geds/raw/baseline',
'geds/raw/channel',
'geds/raw/energy',
'geds/raw/ievt',
'geds/raw/numtraces',
'geds/raw/packet_id',
'geds/raw/timestamp',
'geds/raw/tracelist',
'geds/raw/waveform',
'geds/raw/wf_max',
'geds/raw/wf_std']
Note: Alternatively to ls(), show() [docs] prints a nice representation of the LH5 file contents (with LGDO types) on screen:
[4]:
lh5.show(lh5_file)
/
└── geds · HDF5 group
└── raw · table{packet_id,ievt,timestamp,numtraces,tracelist,baseline,energy,channel,wf_max,wf_std,waveform}
├── baseline · array<1>{real}
├── channel · array<1>{real}
├── energy · array<1>{real}
├── ievt · array<1>{real}
├── numtraces · array<1>{real}
├── packet_id · array<1>{real}
├── timestamp · array<1>{real}
├── tracelist · array<1>{array<1>{real}}
│ ├── cumulative_length · array<1>{real}
│ └── flattened_data · array<1>{real}
├── waveform · table{t0,dt,values}
│ ├── dt · array<1>{real}
│ ├── t0 · array<1>{real}
│ └── values · array_of_equalsized_arrays<1,1>{real}
├── wf_max · array<1>{real}
└── wf_std · array<1>{real}
The group contains several LGDOs. Let’s read them in memory. read() [docs] reads an LGDO from disk and returns the object in memory. Let’s try to read geds/raw:
[5]:
lh5.read("geds/raw", lh5_file)
[5]:
Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})
As shown by the type signature, it is interpreted as a Table with 100 rows. Its contents (or “columns”) can be therefore viewed as LGDO objects of the same length. For example timestamp:
[6]:
lh5.read("geds/raw/timestamp", lh5_file)
[6]:
Array([0.79465985 0.7968994 0.79960424 ... 0.97331905 0.974689
0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'})
is an LGDO Array with 100 elements.
read() also allows to perform more advanced data reading. For example, let’s read only rows from 15 to 25:
[7]:
obj = lh5.read("geds/raw/timestamp", lh5_file, start_row=15, n_rows=10)
print(obj)
[0.82679445 0.8307392 0.8298773 0.830739 0.8339691 0.83487684
0.83510256 0.83612865 0.83797085 0.8406608 ] with attrs={'units': 's'}
Or, let’s read only columns timestamp and energy from the geds/raw table and rows [1, 3, 7, 9, 10, 15]:
[8]:
obj = lh5.read(
"geds/raw", lh5_file, field_mask=("timestamp", "energy"), idx=[1, 3, 7, 9, 10, 15]
)
print(obj)
timestamp energy
0.796899 8642
0.799604 13015
0.812317 22085
0.813282 26636
0.813520 2648
0.826794 7799
with attrs['timestamp']={'units': 's'}
As you might have noticed, read() loads all the requested data in memory at once. This can be a problem when dealing with large datasets. LH5Iterator [docs] makes it possible to handle data one chunk at a time (sequentially) to avoid running out of memory:
[9]:
from lh5 import LH5Iterator
for lh5_obj in LH5Iterator(lh5_file, "geds/raw/energy", buffer_len=20):
print(f"energy = {lh5_obj} ({len(lh5_obj)} rows)")
energy = [3304 8642 9177 ... 8289 7091 4084] (20 rows)
energy = [11546 2873 3193 ... 4114 29557 12309] (20 rows)
energy = [ 6455 3302 5314 ... 37333 4262 15131] (20 rows)
energy = [ 6117 3358 3132 ... 2949 3691 10402] (20 rows)
energy = [ 4088 41153 34295 ... 37877 6014 3410] (20 rows)
If working with many files at the same time, theLH5Store [docs] class might come handy:
[10]:
from lh5 import LH5Store
store = LH5Store(
keep_open=True
) # with keep_open=True, files are kept open inside the store
store.read("geds/raw", lh5_file)
[10]:
Table(dict={'packet_id': Array([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), 'ievt': Array([0 0 ... 3 32], attrs={'datatype': 'array<1>{real}'}), 'timestamp': Array([0.79465985 0.7968994 ... 0.974689 0.9786208 ], attrs={'datatype': 'array<1>{real}', 'units': 's'}), 'numtraces': Array([1 1 ... 1 1], attrs={'datatype': 'array<1>{real}'}), 'tracelist': VectorOfVectors(flattened_data=Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), cumulative_length=_OffsetArrayView([1 2 ... 99 100], attrs={'datatype': 'array<1>{real}'}), attrs={'datatype': 'array<1>{array<1>{real}}'}), 'baseline': Array([13722 13044 ... 9931 17013], attrs={'datatype': 'array<1>{real}'}), 'energy': Array([3304 8642 ... 6014 3410], attrs={'datatype': 'array<1>{real}'}), 'channel': Array([53 60 ... 30 53], attrs={'datatype': 'array<1>{real}'}), 'wf_max': Array([16352 20549 ... 14317 19567], attrs={'datatype': 'array<1>{real}'}), 'wf_std': Array([1028.1815 3084.8018 ... 1876.9403 1065.3331], attrs={'datatype': 'array<1>{real}'}), 'waveform': WaveformTable(dict={'t0': Array([0. 0. ... 0. 0.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'dt': Array([16. 16. ... 16. 16.], attrs={'datatype': 'array<1>{real}', 'units': 'ns'}), 'values': ArrayOfEqualSizedArrays([[13712 13712 ... 15380 15400] [13072 13072 ... 18819 18806] ... [9962 9962 ... 13326 13269] [16918 16918 ... 18877 18933]], attrs={'datatype': 'array_of_equalsized_arrays<1,1>{real}'})}, attrs={'datatype': 'table{dt,t0,values}'})}, attrs={'datatype': 'table{baseline,channel,energy,ievt,numtraces,packet_id,timestamp,tracelist,waveform,wf_max,wf_std}'})
Have a look at the API reference for more documentation.
There are also some more complex LGDO objects, for example the Histogram. The LH5 data structure of a histogram is fixed and cannot be amended without losing its data type. It also cannot be partially read or streamed.
[11]:
histogram_file = ldata.get_path("lh5/lgdo-histograms.lh5")
histogram = lh5.read("test_histogram_range_w_attrs", histogram_file)
print(histogram)
{
'axis_0': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
'axis_1': first=-5.0, last=5.0, step=0.5, closedleft=True with attrs={'units': 'm'},
}
But a convenient way to get all necessary details is available, encapsulating the complexity of the underlying structure that is stored in the file, for example (also see the [docs] for all available properties):
[12]:
histogram.binning[0].first, histogram.binning[0].last, histogram.isdensity
[12]:
(np.float64(-5.0), np.float64(5.0), np.False_)
Reading LH5 data to alternative formats¶
Each LGDO is equipped with a class method called view_as() [docs], which allows the user to “view” the data (i.e. avoiding copying data as much as possible) in a different, third-party format.
LGDOs generally support viewing as NumPy (np), Pandas (pd) or Awkward (ak) data structures, with some exceptions. We strongly recommend having a look at the view_as() API docs of each LGDO type for more details (for Table.view_as() [docs], for example).
Note: To obtain a copy of the data in the selected third-party format, the user can call the appropriate third-party copy method on the view (e.g. pandas.DataFrame.copy(), if viewing the data as a Pandas dataframe).
Let’s play around with our good old table, can we view it as a Pandas dataframe?
[13]:
obj = lh5.read("geds/raw", lh5_file)
df = obj.view_as("pd")
df
[13]:
| packet_id | ievt | timestamp | numtraces | tracelist | baseline | energy | channel | wf_max | wf_std | waveform_t0 | waveform_dt | waveform_values | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0.794660 | 1 | [53] | 13722 | 3304 | 53 | 16352 | 1028.181519 | 0.0 | 16.0 | [13712 13712 13683 ... 15445 15380 15400] |
| 1 | 2 | 0 | 0.796899 | 1 | [60] | 13044 | 8642 | 60 | 20549 | 3084.801758 | 0.0 | 16.0 | [13072 13072 12992 ... 18842 18819 18806] |
| 2 | 3 | 0 | 0.799604 | 1 | [40] | 13508 | 9177 | 40 | 20119 | 2849.593750 | 0.0 | 16.0 | [13575 13575 13496 ... 18409 18384 18457] |
| 3 | 4 | 0 | 0.799604 | 1 | [41] | 11891 | 13015 | 41 | 21215 | 4023.562256 | 0.0 | 16.0 | [11862 11862 11808 ... 18946 18861 18834] |
| 4 | 5 | 1 | 0.801617 | 1 | [60] | 14353 | 3794 | 60 | 17150 | 1183.935913 | 0.0 | 16.0 | [14432 14432 14409 ... 16380 16403 16410] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 96 | 30 | 0.965948 | 1 | [53] | 15597 | 5361 | 53 | 29471 | 947.814392 | 0.0 | 16.0 | [27818 27818 27755 ... 26594 26562 26493] |
| 96 | 97 | 44 | 0.971051 | 1 | [60] | 14599 | 3748 | 60 | 17324 | 1151.771362 | 0.0 | 16.0 | [14471 14471 14515 ... 16583 16582 16609] |
| 97 | 98 | 31 | 0.973319 | 1 | [53] | 15438 | 37877 | 53 | 42319 | 11780.090820 | 0.0 | 16.0 | [15405 15405 15366 ... 36208 36214 36233] |
| 98 | 99 | 3 | 0.974689 | 1 | [30] | 9931 | 6014 | 30 | 14317 | 1876.940308 | 0.0 | 16.0 | [9962 9962 9949 ... 13269 13326 13269] |
| 99 | 100 | 32 | 0.978621 | 1 | [53] | 17013 | 3410 | 53 | 19567 | 1065.333130 | 0.0 | 16.0 | [16918 16918 16962 ... 18715 18877 18933] |
100 rows × 13 columns
Yes! But how are the nested objects being handled?
Nested tables have been flattened by prefixing their column names with the table object name (obj.waveform.values becomes df.waveform_values) and multi-dimensional columns are represented by Awkward arrays:
[14]:
df.waveform_values
[14]:
0 [13712 13712 13683 ... 15445 15380 15400]
1 [13072 13072 12992 ... 18842 18819 18806]
2 [13575 13575 13496 ... 18409 18384 18457]
3 [11862 11862 11808 ... 18946 18861 18834]
4 [14432 14432 14409 ... 16380 16403 16410]
...
95 [27818 27818 27755 ... 26594 26562 26493]
96 [14471 14471 14515 ... 16583 16582 16609]
97 [15405 15405 15366 ... 36208 36214 36233]
98 [9962 9962 9949 ... 13269 13326 13269]
99 [16918 16918 16962 ... 18715 18877 18933]
Name: waveform_values, Length: 100, dtype: awkward
But what if we wanted to have the waveform values as a NumPy array?
[15]:
obj.waveform.values.view_as("np")
[15]:
array([[13712, 13712, 13683, ..., 15445, 15380, 15400],
[13072, 13072, 12992, ..., 18842, 18819, 18806],
[13575, 13575, 13496, ..., 18409, 18384, 18457],
...,
[15405, 15405, 15366, ..., 36208, 36214, 36233],
[ 9962, 9962, 9949, ..., 13269, 13326, 13269],
[16918, 16918, 16962, ..., 18715, 18877, 18933]],
shape=(100, 5592), dtype=uint16)
Can we just view the full table as a huge Awkward array? Of course:
[16]:
obj.view_as("ak")
[16]:
[{packet_id: 1, ievt: 0, timestamp: 0.795, numtraces: 1, tracelist: [53], ...},
{packet_id: 2, ievt: 0, timestamp: 0.797, numtraces: 1, tracelist: [60], ...},
{packet_id: 3, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [40], ...},
{packet_id: 4, ievt: 0, timestamp: 0.8, numtraces: 1, tracelist: [41], ...},
{packet_id: 5, ievt: 1, timestamp: 0.802, numtraces: 1, tracelist: [60], ...},
{packet_id: 6, ievt: 2, timestamp: 0.807, numtraces: 1, tracelist: [60], ...},
{packet_id: 7, ievt: 3, timestamp: 0.812, numtraces: 1, tracelist: [64], ...},
{packet_id: 8, ievt: 0, timestamp: 0.812, numtraces: 1, tracelist: [47], ...},
{packet_id: 9, ievt: 1, timestamp: 0.813, numtraces: 1, tracelist: [53], ...},
{packet_id: 10, ievt: 4, timestamp: 0.813, numtraces: 1, tracelist: [60], ...},
...,
{packet_id: 92, ievt: 42, timestamp: 0.96, numtraces: 1, tracelist: [60], ...},
{packet_id: 93, ievt: 43, timestamp: 0.965, numtraces: 1, tracelist: ..., ...},
{packet_id: 94, ievt: 28, timestamp: 0.963, numtraces: 1, tracelist: ..., ...},
{packet_id: 95, ievt: 29, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
{packet_id: 96, ievt: 30, timestamp: 0.966, numtraces: 1, tracelist: ..., ...},
{packet_id: 97, ievt: 44, timestamp: 0.971, numtraces: 1, tracelist: ..., ...},
{packet_id: 98, ievt: 31, timestamp: 0.973, numtraces: 1, tracelist: ..., ...},
{packet_id: 99, ievt: 3, timestamp: 0.975, numtraces: 1, tracelist: [30], ...},
{packet_id: 100, ievt: 32, timestamp: 0.979, numtraces: 1, ...}]
--------------------------------------------------------------------------------
backend: cpu
nbytes: 1.1 MB
type: 100 * {
packet_id: uint32,
ievt: int32,
timestamp: float32,...Note that viewing a VectorOfVector as an Awkward array is a nearly zero-copy operation and opens a new avenue of fast computational possibilities thanks to Awkward:
[17]:
import awkward as ak
# tracelist is a VoV on disk
trlist = obj.tracelist.view_as("ak")
ak.mean(trlist)
[17]:
np.float64(54.31)
Last but not least, we support attaching physical units (that might be stored in the units attribute of an LGDO) to data views through Pint, if the third-party format allows it:
[18]:
df = obj.view_as("pd", with_units=True)
df.timestamp.dtype
[18]:
pint[s][Float64]
Note that we also provide the read_as() [docs] shortcut to save some typing, for users that would like to read LH5 data on disk straight into some third-party format:
[19]:
lh5.read_as("geds/raw", lh5_file, "pd", with_units=True)
[19]:
| packet_id | ievt | timestamp | numtraces | tracelist | baseline | energy | channel | wf_max | wf_std | waveform_t0 | waveform_dt | waveform_values | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0.79466 | 1 | [53] | 13722 | 3304 | 53 | 16352 | 1028.181519 | 0.0 | 16.0 | [13712 13712 13683 ... 15445 15380 15400] |
| 1 | 2 | 0 | 0.796899 | 1 | [60] | 13044 | 8642 | 60 | 20549 | 3084.801758 | 0.0 | 16.0 | [13072 13072 12992 ... 18842 18819 18806] |
| 2 | 3 | 0 | 0.799604 | 1 | [40] | 13508 | 9177 | 40 | 20119 | 2849.593750 | 0.0 | 16.0 | [13575 13575 13496 ... 18409 18384 18457] |
| 3 | 4 | 0 | 0.799604 | 1 | [41] | 11891 | 13015 | 41 | 21215 | 4023.562256 | 0.0 | 16.0 | [11862 11862 11808 ... 18946 18861 18834] |
| 4 | 5 | 1 | 0.801617 | 1 | [60] | 14353 | 3794 | 60 | 17150 | 1183.935913 | 0.0 | 16.0 | [14432 14432 14409 ... 16380 16403 16410] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 96 | 30 | 0.965948 | 1 | [53] | 15597 | 5361 | 53 | 29471 | 947.814392 | 0.0 | 16.0 | [27818 27818 27755 ... 26594 26562 26493] |
| 96 | 97 | 44 | 0.971051 | 1 | [60] | 14599 | 3748 | 60 | 17324 | 1151.771362 | 0.0 | 16.0 | [14471 14471 14515 ... 16583 16582 16609] |
| 97 | 98 | 31 | 0.973319 | 1 | [53] | 15438 | 37877 | 53 | 42319 | 11780.090820 | 0.0 | 16.0 | [15405 15405 15366 ... 36208 36214 36233] |
| 98 | 99 | 3 | 0.974689 | 1 | [30] | 9931 | 6014 | 30 | 14317 | 1876.940308 | 0.0 | 16.0 | [9962 9962 9949 ... 13269 13326 13269] |
| 99 | 100 | 32 | 0.978621 | 1 | [53] | 17013 | 3410 | 53 | 19567 | 1065.333130 | 0.0 | 16.0 | [16918 16918 16962 ... 18715 18877 18933] |
100 rows × 13 columns
Some types also support other specialized python libraries. For example, the Histogram [docs] type allows us to easily show the data using the hist package:
[20]:
histogram.view_as("hist")
[20]:
Regular(20, -5, 5, underflow=False, overflow=False, label='Axis 1')
Double() Σ=5000.0
Note: In this case, a true copy of the data is made. This is a limitation imposed by Hist’s library design.
Histograms are also a example of a LGDO type that does not support all of the usual types for its view_as function (i.e. pd or ak are unsupported).
Writing data to disk¶
Let’s start by creating some LGDOs:
[21]:
import numpy as np
from lgdo import Array, Scalar, WaveformTable
rng = np.random.default_rng(12345)
scalar = Scalar("made with legend-pydataobj!")
array = Array(rng.random(size=10))
wf_table = WaveformTable(values=rng.integers(low=1000, high=5000, size=(10, 1000)))
The write() [docs] function makes it possible to write LGDO objects on disk. Let’s start by writing scalar with name message in a file named my_data.lh5 in the current directory:
[22]:
lh5.write(scalar, name="message", lh5_file="my_objects.lh5", wo_mode="overwrite_file")
Let’s now inspect the file contents:
[23]:
lh5.show("my_objects.lh5")
/
└── message · string
The string object has been written at the root of the file /. Let’s now write also array and wf_table, this time in a HDF5 group called closet:
[24]:
lh5.write(array, name="numbers", group="closet", lh5_file="my_objects.lh5")
lh5.write(wf_table, name="waveforms", group="closet", lh5_file="my_objects.lh5")
lh5.show("my_objects.lh5")
/
├── closet · struct{numbers}
│ ├── numbers · array<1>{real}
│ └── waveforms · table{dt,t0,values}
│ ├── dt · array<1>{real}
│ ├── t0 · array<1>{real}
│ └── values · array_of_equalsized_arrays<1,1>{real}
└── message · string
Everything looks right!
Note: lh5.write() allows for more advanced usage, like writing only some rows of the input object or appending to existing array-like structures. Have a look at the [docs] for more information.
This page has been automatically generated by nbsphinx and can be run as a Jupyter notebook available in the legend-lh5io repository.