lh5.io package¶
Routines for reading and writing LEGEND Data Objects in HDF5 files.
Currently the primary on-disk format for LGDO objects is LEGEND HDF5 (LH5) files. IO
is done via the class store.LH5Store. LH5 files can also be
browsed easily in Python like any HDF5 file using
h5py.
Subpackages¶
Submodules¶
lh5.io.concat module¶
- lh5.io.concat._get_lgdos(file, obj_list)¶
Get names of array-like LGDO objects present in the file.
- lh5.io.concat._get_obj_list(lh5_files, include_list=None, exclude_list=None)¶
Extract a list of lh5 objects to concatenate.
- lh5.io.concat._inplace_table_filter(name, table, obj_list)¶
Filter objects nested in this LGDO.
- lh5.io.concat._remove_nested_fields(lgdos, obj_list)¶
Remove (nested) table fields based on obj_list.
- lh5.io.concat.lh5concat(lh5_files, output, overwrite=False, *, include_list=None, exclude_list=None, progress=False)¶
Concatenate LGDO Arrays, VectorOfVectors and Tables in LH5 files.
- Parameters:
lh5_files (list) – list of input files to concatenate.
output (str) – path to the output file.
overwrite (bool) – if
True, overwrite the output file if it already exists.include_list (list | None) – patterns for tables to include.
exclude_list (list | None) – patterns for tables to exclude.
progress (bool) – if
True, display a progress bar.
lh5.io.core module¶
- lh5.io.core.read(name, lh5_file, start_row=0, n_rows=9223372036854775807, idx=None, use_h5idx=False, field_mask=None, obj_buf=None, obj_buf_start=0, decompress=True, locking=False)¶
Read LH5 object data from a file.
- Parameters:
name (str) – Name of the LH5 object to be read (including its group path).
lh5_file (str | Path | File | Sequence[str | Path | File]) – The file(s) containing the object to be read out. If a list of files, array-like object data will be concatenated into the output object.
start_row (int) – Starting entry for the object read (for array-like objects). For a list of files, only applies to the first file.
n_rows (int) – The maximum number of rows to read (for array-like objects). The actual number of rows read will be returned as one of the return values (see below).
idx (ArrayLike | Sequence[ArrayLike]) –
For NumPy-style “fancy indexing” for the read to select only some rows, e.g. after applying some cuts to particular columns. Only selection along the first axis is supported, so tuple arguments must be one-tuples. A 2D array of shape (N, 2) can be. used to provide a list of ranges. To use with a list of files, can pass in a list/tuple of idx’s (one for each file) or use a long contiguous list (e.g. built from a previous identical read). If used in conjunction with start_row and n_rows, will be sliced to obey those constraints, where n_rows is interpreted as the (max) number of selected values (in idx) to be read out.
Note
If a list of arrays is provided with the same length as number of files, it will be split among the files even if it was intended as a list of ranges. In that case, explicitly provide the list as a numpy array of shape (n, 2). If a list of arrays has a length different from the number of files, attempt to interpret as shape (n, 2) array.
use_h5idx (bool) – deprecated and has no effect.
field_mask (Mapping[str, bool] | Sequence[str] | None) – For tables and structs, determines which fields get read out. Nested struct elements can be accessed by using
/as a separator (e.g. refer to fieldainside the tabletablewhich is stored inside the structstructasstruct/table/a). If a dict is used, a default dict will be made with the default set to the opposite of the first element in the dict. This way if one specifies a few fields atFalse, all but those fields will be read out, while if one specifies just a few fields asTrue, only those fields will be read out. If a list is provided, the listed fields will be set toTrue, while the rest will default toFalse.obj_buf (LGDO) – Read directly into memory provided in obj_buf. Note: the buffer will be resized to accommodate the data retrieved.
obj_buf_start (int) – Start location in
obj_buffor read. For concatenating data to array-like objects.decompress (bool) – Decompress data encoded with LGDO’s compression routines right after reading. The option has no effect on data encoded with HDF5 built-in filters, which is always decompressed upstream by HDF5.
locking (bool) – Lock HDF5 file while reading
- Returns:
object – the read-out object
- Return type:
- lh5.io.core.read_as(name, lh5_file, library, **kwargs)¶
Read LH5 data from disk straight into a third-party data format view.
This function is nothing more than a shortcut chained call to
read()and toLGDO.view_as().- Parameters:
- Return type:
See also
read,LGDO.view_as
- lh5.io.core.write(obj, name, lh5_file, group='/', start_row=0, n_rows=None, wo_mode='append', write_start=0, page_buffer=0, **h5py_kwargs)¶
Write an LGDO into an LH5 file.
If the obj
LGDOhas a compression attribute, its value is interpreted as the algorithm to be used to compress obj before writing to disk. The type of compression can be:- string, kwargs dictionary, hdf5plugin filter
interpreted as the name of a built-in or custom HDF5 compression filter (
"gzip","lzf",hdf5pluginfilter object etc.) and passed directly toh5py.Group.create_dataset().WaveformCodecobjectIf obj is a
WaveformTableandobj.valuesholds the attribute, compressvaluesusing this algorithm. More documentation about the supported waveform compression algorithms atlgdo.compression.
If the obj
LGDOhas a hdf5_settings attribute holding a dictionary, it is interpreted as a list of keyword arguments to be forwarded directly toh5py.Group.create_dataset()(exactly like the first format of compression above). This is the preferred way to specify HDF5 dataset options such as chunking etc. If compression options are specified, they take precedence over those set with the compression attribute.Note
The compression LGDO attribute takes precedence over the default HDF5 compression settings. The hdf5_settings attribute takes precedence over compression. These attributes are not written to disk.
Note
HDF5 compression is skipped for the encoded_data.flattened_data dataset of
VectorOfEncodedVectorsandArrayOfEncodedEqualSizedArrays.- Parameters:
obj (LGDO) – LH5 object. If object is array-like, writes n_rows starting from start_row in obj.
name (str) – name of the object in the output HDF5 file.
lh5_file (str | Path | File) – HDF5 file name or
h5py.Fileobject.group (str | Group) – HDF5 group name or
h5py.Groupobject in which obj should be written.start_row (int) – first row in obj to be written.
n_rows (int | None) – number of rows in obj to be written.
wo_mode (str) –
write_safeorw: only proceed with writing if the object does not already exist in the file.appendora: append along axis 0 (the first dimension) of array-like objects and array-like subfields of structs.Scalarobjects get overwritten.overwriteoro: replace data in the file if present, starting from write_start. Note: overwriting with write_start = end of array is the same asappend.overwrite_fileorof: delete file if present prior to writing to it. write_start should be 0 (it’s ignored).append_columnorac: append fields/columns from anStructobj (and derived types such asTable) only if there is an existingStructin the lh5_file with the same name. If there are matching fields, it errors out. If appending to aTableand the size of the new column is different from the size of the existing table, it errors out.
write_start (int) – row in the output file (if already existing) to start overwriting from.
page_buffer (int) – enable paged aggregation with a buffer of this size in bytes. Only used when creating a new file. Useful when writing a file with a large number of small datasets. This is a short-hand for
(fs_strategy="page", fs_page_size=page_buffer)**h5py_kwargs – additional keyword arguments forwarded to
h5py.Group.create_dataset()to specify, for example, an HDF5 compression filter to be applied before writing non-scalar datasets. Note: `compression` ignored if compression is specified as an `obj` attribute.
lh5.io.datatype module¶
- lh5.io.datatype._lgdo_datatype_map: dict[str, LGDO] = {<class 'lgdo.types.array.Array'>: '^array<\\d+>\\{.+\\}$', <class 'lgdo.types.arrayofdetectorids.ArrayOfDetectorIDs'>: '^array<\\d+>\\{detectorid\\}$', <class 'lgdo.types.arrayofequalsizedarrays.ArrayOfEqualSizedArrays'>: '^array_of_equalsized_arrays<1,1>\\{.+\\}$', <class 'lgdo.types.encoded.ArrayOfEncodedEqualSizedArrays'>: '^array_of_encoded_equalsized_arrays<1,1>\\{.+\\}$', <class 'lgdo.types.encoded.VectorOfEncodedVectors'>: '^array<1>\\{encoded_array<1>\\{.+\\}\\}$', <class 'lgdo.types.fixedsizearray.FixedSizeArray'>: '^fixedsize_array<\\d+>\\{.+\\}$', <class 'lgdo.types.histogram.Histogram'>: '^struct\\{(?:binning,weights,isdensity|binning,isdensity,weights|weights,binning,isdensity|weights,isdensity,binning|isdensity,binning,weights|isdensity,weights,binning)\\}$', <class 'lgdo.types.scalar.Scalar'>: '^real$|^bool$|^complex$|^bool$|^string$', <class 'lgdo.types.struct.Struct'>: '^struct\\{.*\\}$', <class 'lgdo.types.table.Table'>: '^table\\{.*\\}$', <class 'lgdo.types.vectorofvectors.VectorOfVectors'>: '^array<1>\\{array<1>\\{.+\\}\\}$'}¶
Mapping between LGDO types and regular expressions defining the corresponding datatype string.
- lh5.io.datatype.datatype(expr)¶
Return the LGDO type corresponding to a datatype string.
- Return type:
- lh5.io.datatype.get_nested_datatype_string(expr)¶
Matches the content of the outermost curly brackets.
- Return type:
lh5.io.exceptions module¶
lh5.io.iterator module¶
- class lh5.io.iterator.LH5Iterator(lh5_files, groups, *, base_path='', entry_list=None, entry_mask=None, i_start=0, n_entries=None, field_mask=None, group_data=None, buffer_len='100*MB', file_cache=10, ds_map=None, friend=None, friend_prefix='', friend_suffix='', safe_mode=True, h5py_open_mode='r')¶
Bases:
IteratorIterate over chunks of entries from LH5 files.
The iterator reads
buffer_lenentries at a time from one or more files. The LGDO instance returned at each iteration is reused to avoid reallocations, so copy the data if it should be preserved.Examples
Iterate through a table one chunk at a time and call
processon each chunk:from lh5 import LH5Iterator for table in LH5Iterator("data.lh5", "geds/raw/energy", buffer_len=100): process(table)
LH5Iteratorcan also be used for random access:it = LH5Iterator(files, groups) table = it.read(i_entry)
In case of multiple files or an entry selection,
i_entryrefers to the global event index across all files.When instantiating an iterator you must provide a list of files and the HDF5 groups to read. Optional parameters allow field masking, event selection and pairing the iterator with a “friend” iterator that is read in parallel. Several properties are available to obtain the provenance of the data currently loaded:
current_i_entry– index within the entry list of the first entry in the buffercurrent_local_entries– entry numbers relative to the file the data came fromcurrent_global_entries– entry number relative to the full datasetcurrent_files– file name corresponding to each entry in the buffercurrent_groups– group name corresponding to each entry in the buffer
Constructor for LH5Iterator. Must provide a file or collection of files, and an lh5 group or collection of groups to read data from.
Collections of files and groups can be nested. At the top level, we expect the same number of entries (one set of files to one set of groups). For each corresponding pair of sets, we will loop over each pairing of a file and group, with an inner loop over the groups to minimize the opening of files. Wildcards used for files will be expanded and applied in the inner loop (i.e. each file in a wildcard will read the same groups). If groups is an un-nested collection of strings, use all groups for all files.
Examples
Read “ch1/table” and “ch2/table” from “file.lh5”:
LH5Iterator("/path/to/file.lh5", ["ch1/table", "ch2/table"])
Read “ch1” from all lh5 files in “/path1”, then read “ch1” and “ch2” from “/path2/file.lh5”, and then read “ch1/2/3” from both “file1.lh5” and “file2.lh5”:
LH5Iterator( ["/path1/*.lh5", "/path2/file.lh5", ["/path3/file1.lh5", "/path3/file2.lh5"]], ["ch1/table", ["ch1/table", "ch2/table"], ["ch1/table", "ch2/table", "ch3/table"]] )
- Parameters:
lh5_files (str | Collection[str | Collection[str]]) – file(s) to read from (see above). May include wildcards and environment variables.
groups (str | Collection[str | Collection[str]]) – HDF5 group(s) to read (see above).
base_path (str) – directory path prepended to all file names.
entry_list (Collection[int] | Collection[Collection[int]]) – list of entry numbers to read. If a nested list is provided, expect one top-level list for each file, containing a list of local entries. If a list of ints is provided, use global entries.
entry_mask (Collection[bool] | Collection[Collection[bool]]) – mask of entries to read. If a list of arrays is provided, expect one for each file. Ignore if a selection list is provided.
i_start (int) – index of first entry to start at when iterating
n_entries (int) – number of entries to read before terminating iteration
field_mask (Mapping[str, bool] | Collection[str]) – mask of which fields to read. See
LH5Store.read()for more details.group_data (Mapping[Collection] | ak.Array) – mapping of values corresponding to each provided lh5 group. Values will be duplicated for each entry in each dataset, corresponding to the correct group, and added to the output table. This should have same structure as
groups.buffer_len (int) – number of entries in tables yielded by iterator. Can be provided as a value with a unit of memory; in this case, use the estimated number of rows that will yield tables that require the provided memory. Defaults to
"100*MB".file_cache (int) – maximum number of files to keep open at a time
ds_map (NDArray[int]) – cumulative entries in datasets corresponding to file/group pairs. This can be provided on construction to speed up random or sparse access; otherwise, we sequentially read the size of each group. WARNING: no checks for accuracy are performed so only use this if you know what you are doing!
friend (Collection[LH5Iterator]) – a “friend” LH5Iterator that will be joined to this one, and read in parallel. The friend should have the same length and entry list. Each iteration will return a single LH5 Table containing columns from both iterators. The buffer_len will be set to the minimum of the two.
friend_prefix (str) – prefix for fields in friend iterator for resolving naming conflicts
friend_suffix (str) – suffix for fields in friend iterator for resolving naming conflicts
safe_mode (bool) – if
Trueand a friend iterator has a different number of files, groups, or elements in a dataset, raise an Exception.h5py_open_mode (str) – file open mode used when acquiring file handles.
r(default) opens files read-only whileaallows opening files for write-appending as well.
- _generate_workers(n_workers)¶
Create n_workers copies of this iterator, dividing the datasets (file/group pairs) between them. These are intended for parallel use.
- _select_datasets(i_beg, i_end)¶
Reduce list of files and groups; used by _generate_workers
- add_friend(friend, prefix='', suffix='')¶
Add a friend which will be iterated alongside this, returning a Table joining the contents of each.
- Parameters:
friend (LH5Iterator) – LH5Iterator to be friended to this one
prefix (str) – string prepended to field names; useful for disambiguating conflicts
suffix (str) – string appended to field names; useful for disambiguating conflicts
- property buffer_len¶
- property current_files: ndarray[tuple[Any, ...], dtype[str]]¶
Return list of file names for entries in buffer
- property current_global_entries: ndarray[tuple[Any, ...], dtype[int]]¶
Return list of global file entries in buffer
- property current_groups: ndarray[tuple[Any, ...], dtype[str]]¶
Return list of group names for entries in buffer
- property current_local_entries: ndarray[tuple[Any, ...], dtype[int]]¶
Return list of local dataset entries in buffer
- get_group_data(i_ds)¶
Get group data for dataset i_ds
- Return type:
Record | None
- hist(ax, where=None, keys=None, processes=None, executor=None, progress=True, **hist_kwargs)¶
Fill a histogram from data produced by a query selecting on
where. IfwhereisNone, fill with all data fetched by iterator.Examples
Build a 1D histogram of values in
col3with a string selection:h = lh5_it.hist( hist.axis.Regular(100, 0, 500, label="col3"), where="(col1 == 0) & (col2 > 100)", keys="col3", )
Build a 2D histogram with a value axis and string-category axis after applying some processing:
def get_val(lh5_tab, lh5_it): ...process data return value, category h = lh5_it.hist( [hist.axis.Regular(100, 0, 500, label="Value"), hist.axis.StrCategory([], growth=True, label="Category")], where=get_val, )
- Parameters:
ax (Hist | axis | Collection[axis]) –
hist.axisobject(s) used to construct the histogram. Can provide ahist.Histwhich will be filled as well.where (Callable | str) –
A filter function for selecting data entries to put into the histogram. Can be:
A function that returns reduced data, with signature
fun(lh5_obj: Table, it: LH5Iterator). Can return:numpy.ndarray: if 1D list of values; if 2D list of lists of values in same order as axesCollection[ArrayLike]: return list of values in same order as axesMapping[str, ArrayLike]: mapping from axis name to valuespandas.DataFrame: treat as mapping from column name to values
A string expression. This will call
eval, with the table columns provided as local variables formatted asawkward.Array(), and access toawkward(orak) andnumpy(ornp).
keys (Collection[str] | str) – list of keys fields corresponding to axes. Use if where returns a mapping with names different from axis names.
processes (Executor | int) – number of processes. If
None, use number equal to threads available toexecutor(if provided), or else do not parallelizeexecutor (Executor) –
concurrent.futures.Executorobject for managing parallelism. IfNone, create aconcurrent.futures.ProcessPoolExecutorwith number of processes equal toprocesses.progress (progress.Progress | console.Console | bool) – if
Truedraw progress bar; can also provide an existing richProgressorConsoleobjecthist_kwargs – additional keyword arguments for constructing
hist.Hist.
- Return type:
Hist
- map(fun, aggregate=None, init=None, begin=None, terminate=None, processes=None, executor=None, progress_queue=None, job_id=0)¶
Map function over iterator blocks.
Returns order-preserving list of outputs. Can be multi-threaded provided there are no attempts to modify existing objects. Multi-threading splits the iterator into multiple independent streams with an approximately equal number of files/groups, concurrently processed under a single program multiple data model. Results will be returned asynchronously for each process.
Example
Process a table and sum the products at the end:
def process(lh5_tab, lh5_it): ...process the table return result_of_processing results = lh5_it.map(process, processes=4) # results are an iterator over lists result = sum(val for result in results for val in result)
Process a table as above, using aggregate to sum the results:
def process(lh5_tab, lh5_it): ...process the table return result_of_processing result = lh5_it.map(process, aggregate=np.add, init=0, processes=4)
Process a table using a more arbitrary output:
class Result: def __init__(self): ...initialize @classmethod def process_table(tab): ...process the table def aggregate(self, result): ...add data from processing into object result = lh5_it.map( Result.process_table, aggregate=Result.aggregate, init=Result(), processes=4 )
- Parameters:
fun (Callable[Table, LH5Iterator, Any]) – function with signature
fun(lh5_obj: Table, it: LH5Iterator) -> AnyOutputs of function will be collected in list and returnedaggregate (Callable) – function used to iterably combine outputs of
funfor each block of data. Should have two inputs; first input should be the type of the aggregate, and second of the type returned byfun. This function can either return the result, or perform the aggregation in-place on the first element and returnNone. If using multi-processing,mapwill return an async-iterator over the aggregated results from each process. IfNone, do not aggregate and instead return will iterate over result for each block.init (Any) – initial value used for aggregation. If using an aggregating function and
initisNone, perform a deep copy of the first elementbegin (Callable[LH5Iterator]) – function with signature
fun(it: LH5Iterator)that is run before we loop through a chunk of the iteratorterminate (Callable[LH5Iterator]) – function with the signature
fun(it: LH5Iterator)that is run after we finish looping through a chunk of the iteratorprocesses (int) – number of processes. If
None, use number equal to threads available toexecutor(if provided), or else do not parallelizeexecutor (Executor) –
concurrent.futures.Executorobject for managing parallelism. IfNone, create aconcurrent.futures.ProcessPoolExecutorwith number of processes equal toprocesses.progress_queue (Queue) –
multiprocessing.Queueobject to which progress information will be communicated back to main process. Returns a mapping with keys: - task_id: the job_id passed to this function - total: total number of datasets to be processed - completed: number of datasets that have been processed - entries: number of entries that have been processed - status: “Initializing”, “Processing”, “Terminating” or “Finished”job_id (int | Collection[int]) – index of first process (see
task_idabove; subsequent processes will increment by 1) or list oftask_idsfor each process
- Return type:
Iterator[Any]
- query(where, *, fields=None, processes=None, executor=None, library=None, progress=True)¶
Query the data files in the iterator
Returns the selected data as a single table in one of several formats.
Examples
Query data using a string selection:
tab = lh5_it.query("(col1 == 0) & (col2 > 100)")
Query data using a function:
def select(lh5_tab, lh5_it): ...process data and produce a new table return result tab = lh5_it.query(select)
- Parameters:
A filter function for selecting data entries. Can be:
A function that returns reduced data, with signature
fun(lh5_obj: Table, it: LH5Iterator). Can return:numpy.ndarray: if 1D list of values; if 2D list of lists of values in same order as axesCollection[ArrayLike]: return list of values in same order as axesMapping[str, ArrayLike]: mapping from axis name to valuespandas.DataFrame: pandas dataframe. Treat as mapping from column name to values
A string expression. This will call
eval, with the table columns provided as local variables formatted asawkward.Array(), and access toawkward(orak) andnumpy(ornp). Return a table formatted according tolibrary
fields (Collection[str] | Mapping[str, str | None]) – list of fields to return. If
Nonereturn all fields infield_mask. If a mapping is provided, key corresponds to name of field in this iterator, and value is an alias to name in returned table; if alias isNone, do not rename.processes (Executor | int) – number of processes. If
None, use number equal to threads available toexecutor(if provided), or else do not parallelizeexecutor (Executor) –
concurrent.futures.Executorobject for managing parallelism. IfNone, create aconcurrent.futures.ProcessPoolExecutorwith number of processes equal toprocesses.library (str) – library to convert the columns to when using a string expression for
where. SeeTable.eval().progress (Progress | Console | bool) – if
Truedraw progress bar; can also provide an existing richProgressorConsoleobject
- read(i_entry, n_entries=None)¶
Read a chunk of events starting at global entry i_entry.
- Return type:
Table
- reset_field_mask(mask, warn_missing=True)¶
Replaces the field mask of this iterator and any friends with mask.
If
None, set this and all friends to have no mask.If a collection of strings or mapping from strings to bools, set the mask for this and all friends; in the case of a conflict, use first column found. If a prefix or suffix is included for the friend, it must be included in this mask
If a collection of collections, use the first item to set this mask, and subsequent items to set friend masks. In this case, do not include prefixes or suffixes in names
- set_group_data(group_data)¶
Set group data which will be joined to each table based on the current file/group. Should have same structure of groups, with one entry per file and either one subentry per group or one entry that will be broadcast across all groups.
Note: unlike in the constructor, the group_data will not be broadcast over all files!
- class lh5.io.iterator.MapProgress(tasks, prog=None, update_period=0.1)¶
Bases:
ThreadHelper for tracking progress of threads in map
Basic Usage:
with MapProgress(task_list) as prog: iter.map(..., progress_queue = prog.queue)
- Parameters:
tasks (list | int) – list of descriptions to prepend to progress bars. Can also provide the number of tasks, in which case description will be set to
"#i".prog (progress.Progress | console.Console) – rich Progress or Console object to add bars to. Use to customize the bar.
update_period (float) – frequency in seconds to update progress bars.
- run()¶
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
- lh5.io.iterator._append_copy(list, val)¶
Helper for aggregating tables in query
- lh5.io.iterator._identity(val, _)¶
- lh5.io.iterator._map_helper(fun, aggregator, init, begin, terminate, it, i_job, *, progress_queue=None)¶
Helper for executing init, begin and terminate functions when calling map
lh5.io.settings module¶
- lh5.io.settings.DEFAULT_HDF5_SETTINGS: dict[str, ...] = {'compression': 'gzip', 'shuffle': True}¶
Global dictionary storing the default HDF5 settings for writing data to disk.
Modify this global variable before writing data to disk with this package.
Examples
>>> import lh5 >>> lh5.DEFAULT_HDF5_SETTINGS["compression"] = "lzf" >>> lh5.write(data, "data", "file.lh5") # compressed with LZF
- lh5.io.settings.default_hdf5_settings()¶
Returns the HDF5 settings for writing data to disk reset to the package defaults.
Examples
>>> import lh5 >>> lh5.DEFAULT_HDF5_SETTINGS["compression"] = "lzf" >>> lh5.write(data, "data", "file.lh5") # compressed with LZF >>> lh5.DEFAULT_HDF5_SETTINGS = lh5.default_hdf5_settings() >>> lh5.write(data, "data", "file.lh5", "of") # compressed with default settings (GZIP)
lh5.io.store module¶
This module implements routines for reading and writing LEGEND Data Objects in HDF5 files.
- class lh5.io.store.LH5Store(base_path='', keep_open=False, locking=False, default_mode='r')¶
Bases:
objectClass to represent a store of LEGEND HDF5 files. The two main methods implemented by the class are
read()andwrite().Examples
>>> from lh5 import LH5Store >>> store = LH5Store() >>> obj = store.read("/geds/waveform", "file.lh5") >>> type(obj) lgdo.waveformtable.WaveformTable
- Parameters:
base_path (str | Path) – directory path to prepend to LH5 files.
keep_open (bool) – whether to keep files open by storing the
h5pyobjects as class attributes. Ifkeep_openis anint, keep only thenmost recently opened files; ifTrue, no limitlocking (bool) – whether to lock files when reading
default_mode (str) – default mode in which to open files with this
LH5Store. Seeh5py.Filedocumentation. If default_mode is"r", use"a"when calling LH5Store.write.
- get_buffer(name, lh5_file, size=None, field_mask=None)¶
Returns an LH5 object appropriate for use as a pre-allocated buffer in a read loop. Sets size to size if object has a size.
- Return type:
LGDO
- gimme_file(lh5_file, mode=None, page_buffer=0, **file_kwargs)¶
Returns a
h5pyfile object from the store or creates a new one.- Parameters:
mode (str) – mode in which to open file. See
h5py.Filedocumentation. IfNone, use default provided at constructionpage_buffer (int) – enable paged aggregation with a buffer of this size in bytes. Only used when creating a new file. Useful when writing a file with a large number of small datasets. This is a short-hand for
(fs_strategy="page", fs_page_size=page_buffer)file_kwargs – Keyword arguments for
h5py.File
- Return type:
File
- gimme_group(group, base_group, grp_attrs=None, overwrite=False)¶
Returns an existing
h5pygroup from a base group or creates a new one.See also
- Return type:
Group
- read(name, lh5_file, start_row=0, n_rows=9223372036854775807, idx=None, use_h5idx=False, field_mask=None, obj_buf=None, obj_buf_start=0, decompress=True, **file_kwargs)¶
Read LH5 object data from a file in the store.
See also
- read_n_rows(name, lh5_file)¶
Look up the number of rows in an Array-like object called name in lh5_file.
Return
Noneif it is aScalaror aStruct.- Return type:
int | None
- read_size_in_bytes(name, lh5_file)¶
Look up the size (in bytes) of the object in memory. Will recursively crawl through all objects in a Struct or Table.
- Return type:
- write(obj, name, lh5_file, group='/', start_row=0, n_rows=None, wo_mode=None, write_start=0, page_buffer=0, **h5py_kwargs)¶
Write an LGDO into an LH5 file.
See also
lh5.io.tools module¶
- lh5.io.tools.ls(lh5_file, lh5_group='', recursive=False)¶
Return a list of LH5 groups in the input file and group, similar to
lsorh5ls. Supports wildcards in group names.
- lh5.io.tools.show(lh5_file, lh5_group='/', attrs=False, indent='', header=True, depth=None, detail=False)¶
Print a tree of LH5 file contents with LGDO datatype.
- Parameters:
lh5_group (str) – print only contents of this HDF5 group.
attrs (bool) – print the HDF5 attributes too.
indent (str) – indent the diagram with this string.
header (bool) – print lh5_group at the top of the diagram.
depth (int | None) – maximum tree depth of groups to print
detail (bool) – whether to print additional information about how the data is stored
Examples
>>> from lgdo import show >>> show("file.lh5", "/geds/raw") /geds/raw ├── channel · array<1>{real} ├── energy · array<1>{real} ├── timestamp · array<1>{real} ├── waveform · table{t0,dt,values} │ ├── dt · array<1>{real} │ ├── t0 · array<1>{real} │ └── values · array_of_equalsized_arrays<1,1>{real} └── wf_std · array<1>{real}
lh5.io.truncate module¶
Truncate lh5 files. Useful for generating test data.
- class lh5.io.truncate.EvtBasedTruncator(slice_)¶
Bases:
LGDOTruncator- _is_protocol = False¶
- class lh5.io.truncate.HitBasedTruncator(table_key_trunc, row_in_table_trunc)¶
Bases:
LGDOTruncator- _is_protocol = False¶
- _row_indices(lgdo)¶
- Return type:
Array
- class lh5.io.truncate.LGDOTruncator(*args, **kwargs)¶
Bases:
LGDOMappable,ProtocolTry to get info on how much rows to read before performing the actual read for performance improvement.
- _is_protocol = True¶
- lh5.io.truncate.create_evt_ordered_truncation_func(length_or_slice)¶
- Return type:
- lh5.io.truncate.create_hit_ordered_truncation_func(tcm_file, length_or_slice)¶
- Return type:
- lh5.io.truncate.map_lgdo_arrays(func, lgdo, name, *, include_list=None, exclude_list=None)¶
Map a function acting on awkward arrays contained in the LGDO tree onto the tree.
The tree structure itself is not altered (compare to map in functional languages), except if branches are excluded (explicitly or because they are not in the include_list passed). Attributes are propagated unchanged.
- Return type:
LGDO | None
- lh5.io.truncate.map_lgdo_arrays_on_file(infile, outfile, func, overwrite=False, *, include_list=None, exclude_list=None)¶
Run func on all VectorOfVectors and all arrays in the file.
The first argument passed to func is the name of the LGDO, the second is the awkward array contained within.
- lh5.io.truncate.truncate(infile, outfile, length_or_slice, overwrite=False, *, tcm_file=None, include_list=None, exclude_list=None, file_type=None)¶
Truncate an LH5 file and write the result to a new file.
This function produces a truncated copy of infile at outfile by applying length_or_slice to array-like LGDOs contained in the file. There are two supported truncation modes depending on the ordering of the input file:
evt-ordered (file types:
evt,tcm,any-evt): arrays are truncated by applying length_or_slice directly to each array (simple slicing).hit-ordered (file types:
raw,dsp,hit,any-hit): rows belong to channels and the truncation must preserve only those rows that fall into the requestedlength_or_sliceof the corresponding TCM mapping. For this mode a tcm_file must be provided; the function readshardware_tcm_1/row_in_tableandhardware_tcm_1/table_keyfrom the TCM file to build the per-channel row selection.
- Parameters:
infile (str) – Path to the input LH5 file to truncate.
outfile (str) – Path to the resulting truncated LH5 file to create.
length_or_slice (int | slice) – Integer number of rows to keep (keeps first N rows) or a
sliceobject to select rows. The semantics differ slightly between evt- and hit-ordered files (see above).overwrite (bool) – If True, the output file will be overwritten when created; otherwise a safe write/append strategy is used.
tcm_file (str | None) – Path to a TCM LH5 file. Required for hit-ordered truncation to map channel keys to row indices.
include_list (list[str] | None) – Optional list of fnmatch patterns selecting LGDO paths to include.
exclude_list (list[str] | None) – Optional list of fnmatch patterns selecting LGDO paths to exclude.
file_type (str | None) – Optional override of the file type (auto-deduced from infile when not provided). Use values like
evt,tcm,raw,dsp,hit, or the genericany-evt/any-hit.
- lh5.io.truncate.truncate_array_channel(input_array, table_key_trunc, row_in_table_trunc, channel_key)¶
lh5.io.utils module¶
Implements utilities for LEGEND Data Objects.
- lh5.io.utils.expand_path(path, substitute=None, list=False, base_path=None)¶
Expand (environment) variables and wildcards to return absolute paths.
- Parameters:
path (str | Path) – name of path, which may include environment variables and wildcards.
list (bool) – if
True, return a list. IfFalse, return a string; ifFalseand a unique file is not found, raise an exception.substitute (dict[str, str] | None) – use this dictionary to substitute variables. Environment variables take precedence.
base_path (str | Path | None) – name of base path. Returned paths will be relative to base.
- Returns:
path or list of paths – Unique absolute path, or list of all absolute paths
- Return type:
- lh5.io.utils.expand_vars(expr, substitute=None)¶
Expand (environment) variables.
Note
Malformed variable names and references to non-existing variables are left unchanged.
- lh5.io.utils.fmtbytes(num, suffix='B')¶
Returns formatted f-string for printing human-readable number of bytes.
- lh5.io.utils.get_buffer(name, lh5_file, size=None, field_mask=None)¶
Returns an LGDO appropriate for use as a pre-allocated buffer.
Sets size to size if object has a size.
- Return type:
LGDO
- lh5.io.utils.get_h5_group(group, base_group, grp_attrs=None, overwrite=False)¶
Returns an existing
h5pygroup from a base group or creates a new one. Can also set (or replace) group attributes.