hdf5Lib

Small Python3 script I have written for personal use when loading data stored in multiple HDF5 subfiles, as is the case in many cosmological simulations I have encountered. It contains a single class Read which creates an object from which one can access information stored in one or more HDF5 files. This allows for data to be loaded sequentially or in parallel (via multiprocessing module). Loading in parallel can drastically cut down on loading times, particularly when the files are not in cache.

Requirements

This package requires the following modules:h5py(3.1.0), numpy (1.19.0), multiprocessing and tqdm (4.62.0). This code has only been tested using those versions.

Usage

Check the ./examples/ folder to access the test HDF5 files. An interactive, jupyter-notebook version of the tutorial is available in the same location.

import hdf5Lib
path_to_file = './examples/single_file/file.hdf5'
file = hdf5Lib.Read(path_to_file)

Once the object has been created, we can check what data entries are in the file and its attributes.

# Prints all entries accesible at the top of the tree.
file.print_entries()

# Prints entries accesible in dataset_a.
file.print_entries('dataset_a')

# Prints attributes of dataset_a.
file.print_attributes('dataset_a')

If we want to get the value of an attribute, we simply specify the dataset and associated attribute we are interested in retrieving.

pi = file.get_attribute('dataset_a', 'pi')
h  = file.get_attribute('dataset_a/subdataset_1','hubbleParam')

The code is able to handle cases where data has been split across many different (sub)files. In such cases, file paths can be specified in two different ways:

# List with each entry being the path to each individual file 
path_to_files = ['./examples/split_files/subfile_%.2d.hdf5'%i for i in range(50)]
file = hdf5Lib.Read(path_to_files)

# Alternatively, provide a string-formatted path and the number of files the 
# data has been split across (internally, it does the above)
path_to_files = './examples/split_files/subfile_%.2d.hdf5'
number_files  = 50
file = hdf5Lib.Read(path_to_files, number_files = number_files)

Loading data can be done serially or in parallel. The code handles cases where data is split across many different files.

path_to_files = ['./examples/split_files/subfile_%.2d.hdf5'%i for i in range(50)]

# Reads sequentially using a single processor
file_serial_mode = hdf5Lib.Read(path_to_files, parallel=False)
data = file_serial_mode['dataset_a']

# Reads in parallel using as many processes as possible.
file_parallel_mode = hdf5Lib.Read(path_to_files, parallel=True)
data = file_parallel_mode['dataset_a']

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hdf5Lib

Requirements

Usage

About

Releases

Packages

Languages

VictorForouhar/hdf5Lib

Folders and files

Latest commit

History

Repository files navigation

hdf5Lib

Requirements

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages