Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hdf5 references for arrays #118

Merged
merged 93 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
cd342fd
updated plugin structure
aalbino2 Aug 19, 2024
df6219b
added pynxtools dependency
aalbino2 Aug 19, 2024
ce1e60b
Apply suggestions from code review
aalbino2 Aug 20, 2024
de7b48e
Add sections for RSM and 1D which uses HDF5 references
ka-sarthak Sep 5, 2024
47198b9
Abstract out data interaction using setter and getter; allows to use …
ka-sarthak Sep 5, 2024
44ea11f
Use arrays, not references, in the `archive.results` section
ka-sarthak Sep 5, 2024
6b66448
Lock the state for using nexus file and corresponding references
ka-sarthak Sep 5, 2024
8d736b6
Populate results without references
ka-sarthak Sep 5, 2024
cd36d15
Make a general reader for raw files
ka-sarthak Nov 20, 2024
1d612f1
Remove nexus flags
ka-sarthak Nov 20, 2024
f796a79
Add quantity for auxialiary file
ka-sarthak Nov 20, 2024
6c59dad
Fix rebase
ka-sarthak Dec 3, 2024
ad98a38
Make integration_time as hdf5reference
ka-sarthak Dec 4, 2024
1072dc0
Reset results (refactor)
ka-sarthak Dec 4, 2024
65c8659
Add backward compatibility
ka-sarthak Dec 4, 2024
d5445ff
Refactor reader
ka-sarthak Dec 4, 2024
6f16f44
add missing imports
ka-sarthak Dec 4, 2024
d170d30
AttrDict class
ka-sarthak Dec 4, 2024
4f7bc83
Make concept map global
ka-sarthak Dec 4, 2024
a3295b7
Add function to remove nexus annotations in concept map
ka-sarthak Dec 4, 2024
89cdbc9
Move try block inside walk_through_object
ka-sarthak Dec 4, 2024
54902f8
Fix imports
ka-sarthak Dec 4, 2024
393c572
Add methods for generating hdf5 file
ka-sarthak Dec 4, 2024
d2d1d1e
Rename auxiliary file
ka-sarthak Dec 4, 2024
30caf99
Expect aux file to be .nxs in the beginning
ka-sarthak Dec 4, 2024
2f42f70
Add attributes for hdf5: data_dict, dataset_paths
ka-sarthak Dec 4, 2024
2fef457
Method for adding a quantity to hdf5_data_dict
ka-sarthak Dec 4, 2024
893d819
Abstract out methods for creating files based on hdf5_data_dict
ka-sarthak Dec 4, 2024
921a347
Add dataset_paths for nexus
ka-sarthak Dec 4, 2024
9f0cedf
Some reverting back
ka-sarthak Dec 4, 2024
f4fa2bd
Minor fixes
ka-sarthak Dec 4, 2024
97a4083
Refactor populate_hdf5_data_dict: store a reference to be made later
ka-sarthak Dec 5, 2024
191cd9d
Handle shift from nxs to hdf5
ka-sarthak Dec 5, 2024
8f4c4f4
Set hdf5 references after aux file is created
ka-sarthak Dec 5, 2024
5f378d2
Cleaning
ka-sarthak Dec 5, 2024
fdd2222
Fixing
ka-sarthak Dec 5, 2024
f49e168
Redefine result sections instead of extending
ka-sarthak Dec 5, 2024
f6f351c
Remove plotly plots from ELN
ka-sarthak Dec 6, 2024
7cd535e
Read util for hdf5 ref
ka-sarthak Dec 6, 2024
9d67aa2
Fixing
ka-sarthak Dec 6, 2024
9ad1efe
Move hdf5 handling into a util class
ka-sarthak Dec 6, 2024
8b8b9f4
Refactor instance variables
ka-sarthak Dec 6, 2024
382231b
Reset data dicts and reference after each writing
ka-sarthak Dec 6, 2024
f867a9a
Fixing
ka-sarthak Dec 6, 2024
2b75a79
Overwrite dataset if it already exists
ka-sarthak Dec 6, 2024
97f4a3f
Refactor add_dataset
ka-sarthak Dec 6, 2024
10bf72e
Reorganize and doctrings
ka-sarthak Dec 6, 2024
b5b5666
Rename variable
ka-sarthak Dec 6, 2024
88a9dde
Add read_dataset method
ka-sarthak Dec 6, 2024
bbd0a86
Cleaning
ka-sarthak Dec 6, 2024
94bdde7
Adapting schema with hdf5 handler
ka-sarthak Dec 6, 2024
a859be2
Cooments, minor refactoring
ka-sarthak Dec 6, 2024
fea1724
Fixing; add `hdf5_handler` as an attribute for archive
ka-sarthak Dec 6, 2024
df6b571
Reorganization
ka-sarthak Dec 6, 2024
ae96f5d
Fixing
ka-sarthak Dec 6, 2024
e6cd6b8
Refactoring
ka-sarthak Dec 6, 2024
7b7db2b
Cleaning
ka-sarthak Dec 6, 2024
11e4d38
Try block for using hdf5 handler: dont fail early, as later normaliza…
ka-sarthak Dec 7, 2024
de4f605
Extract units from dataset attrs when reading
ka-sarthak Dec 7, 2024
7f12438
Fixing
ka-sarthak Dec 7, 2024
d7f69f7
Linting
ka-sarthak Dec 7, 2024
be9dfa8
Make archive_path optional in add_dataset
ka-sarthak Dec 9, 2024
fb9f1c7
Rename class
ka-sarthak Dec 9, 2024
51915c3
attrs for add_dataset; use it for units
ka-sarthak Dec 9, 2024
090bc18
Add add_attribute method
ka-sarthak Dec 9, 2024
6b2a95e
Refactor add_attribute
ka-sarthak Dec 9, 2024
3207e8e
Add plot attributes: 1D
ka-sarthak Dec 9, 2024
dc3781c
Refactor hdf5 states
ka-sarthak Dec 10, 2024
24b4a9e
Add back plotly figures
ka-sarthak Dec 10, 2024
17a0088
rename auxiliary file name if changed by handler
ka-sarthak Dec 10, 2024
c3302c4
Add referenced plots
ka-sarthak Dec 10, 2024
e993deb
Allow hard link using internel reference
ka-sarthak Dec 10, 2024
d753f4a
Add sections for plots
ka-sarthak Dec 10, 2024
762fe86
Comment out validation
ka-sarthak Dec 10, 2024
18ed92c
Add archive paths for the plot subsections
ka-sarthak Dec 11, 2024
ff286c6
Add back validation with flag
ka-sarthak Dec 11, 2024
374b6bb
Use nexus flag
ka-sarthak Dec 11, 2024
9464f11
Add interpolated intensity data into h5 for qspace plots
ka-sarthak Dec 13, 2024
baac2da
Use prefix to reduce len of string
ka-sarthak Dec 13, 2024
d705957
Store regularized linespace of q vectors; revise descriptions
ka-sarthak Dec 13, 2024
47df0cc
Remove plotly plots
ka-sarthak Dec 13, 2024
fbc87d1
Bring plots to overview
ka-sarthak Dec 13, 2024
c3f2ff3
Fix tests
ka-sarthak Dec 13, 2024
3ba656f
Linting; remove attr arg from add_dataset
ka-sarthak Dec 13, 2024
4a0852c
Review: move none check into method
ka-sarthak Dec 16, 2024
864d53b
Review: use 'with' for opening h5 file
ka-sarthak Dec 17, 2024
122b65f
Review: make internal states as private vars
ka-sarthak Dec 17, 2024
67ee382
Add pydantic basemodel for dataset
ka-sarthak Dec 17, 2024
645348a
Use data from variables if available for reading
ka-sarthak Dec 17, 2024
64eae6c
Review: remove lazy arg
ka-sarthak Dec 18, 2024
e3164ff
Move DatasetModel outside Handler class
ka-sarthak Dec 18, 2024
ac39a30
Remove None from get, as it is already a default
ka-sarthak Dec 19, 2024
3fa6263
Merge if conditions
ka-sarthak Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
308 changes: 308 additions & 0 deletions src/nomad_measurements/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
import collections
import os.path
import re
from typing import (
TYPE_CHECKING,
Any,
)

import h5py
import numpy as np
import pint
from nomad.datamodel.hdf5 import HDF5Reference
from nomad.units import ureg

if TYPE_CHECKING:
from nomad.datamodel.data import (
Expand Down Expand Up @@ -153,3 +160,304 @@ def get_bounding_range_2d(ax1, ax2):
]

return ax1_range, ax2_range


class HDF5Handler:
"""
Class for handling the creation of auxiliary files to store big data arrays outside
the main archive file (e.g. HDF5, NeXus).
"""

def __init__(
self,
filename: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
valid_dataset_paths: list = None,
nexus: bool = False,
):
"""
Initialize the handler.

Args:
filename (str): The name of the auxiliary file.
archive (EntryArchive): The NOMAD archive.
logger (BoundLogger): A structlog logger.
valid_dataset_paths (list): The list of valid dataset paths.
nexus (bool): If True, the file is created as a NeXus file.
"""
if not filename.endswith(('.nxs', '.h5')):
raise ValueError('Only .h5 or .nxs files are supported.')

self.data_file = filename
self.archive = archive
self.logger = logger
self.valid_dataset_paths = []
if valid_dataset_paths:
self.valid_dataset_paths = valid_dataset_paths
self.nexus = nexus

self.hdf5_datasets = collections.OrderedDict()
self.hdf5_attributes = collections.OrderedDict()

def add_dataset( # noqa: PLR0913
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
self,
path: str,
data: Any,
archive_path: str = None,
internal_reference: bool = False,
validate_path: bool = True,
lazy: bool = True,
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
):
"""
Add a dataset to the HDF5 file. The dataset is written lazily (default) when
either `read_dataset` or `write_file` method is called. The `path` is validated
against the `valid_dataset_paths` if provided before adding the data.

Args:
path (str): The dataset path to be used in the HDF5 file.
data (Any): The data to be stored in the HDF5 file.
archive_path (str): The path of the quantity in the archive.
internal_reference (bool): If True, an internal reference is set to an
existing HDF5 dataset.
validate_path (bool): If True, the path is validated against the
`valid_dataset_paths`.
lazy (bool): If True, the file is not written immediately.
"""
if not path:
self.logger.warning('HDF5 `path` must be provided.')
return

if validate_path and self.valid_dataset_paths:
if path not in self.valid_dataset_paths:
self.logger.warning(f'Invalid dataset path "{path}".')
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Merge nested if conditions (merge-nested-ifs)

Suggested change
if validate_path and self.valid_dataset_paths:
if path not in self.valid_dataset_paths:
self.logger.warning(f'Invalid dataset path "{path}".')
return
if validate_path and self.valid_dataset_paths and path not in self.valid_dataset_paths:
self.logger.warning(f'Invalid dataset path "{path}".')
return


ExplanationToo much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.

Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two if conditions can be combined using
and is an easy win.


dataset = dict(
data=data,
attrs={},
hdf5_path=(
f'/uploads/{self.archive.m_context.upload_id}/raw'
f'/{self.data_file}#{path}'
),
archive_path=archive_path,
internal_reference=internal_reference,
)
# handle the pint.Quantity and add data
if isinstance(data, pint.Quantity):
dataset['data'] = data.magnitude
dataset['attrs'].update({'units': str(data.units)})

self.hdf5_datasets[path] = dataset

if not lazy:
self.write_file()
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved

def add_attribute(
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
self,
path: str,
attrs: dict,
lazy: bool = True,
):
"""
Add an attribute to the dataset or group at the given path. The attribute is
written lazily (default) when either `read_dataset` or `write_file` method is
called.

Args:
path (str): The dataset or group path in the HDF5 file.
attrs (dict): The attributes to be added.
lazy (bool): If True, the file is not written immediately.
"""
self.hdf5_attributes[path] = attrs

if not lazy:
self.write_file()

def read_dataset(self, path: str):
"""
Returns the dataset at the given path. If the quantity has `units` as an
attribute, tries to returns a `pint.Quantity`. Before returning the dataset, the
method also writes the file with any pending datasets.

Args:
path (str): The dataset path in the HDF5 file.
"""
if self.hdf5_datasets or self.hdf5_attributes:
self.write_file()
if path is None:
return
file_path, dataset_path = path.split('#')
file_name = file_path.rsplit('/raw/', 1)[1]
with self.archive.m_context.raw_file(file_name, 'r') as h5file:
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
h5 = h5py.File(h5file.name, 'r')
if dataset_path not in h5:
self.logger.warning(f'Dataset "{dataset_path}" not found.')
h5.close()
return None
value = h5[dataset_path][...]
try:
units = h5[dataset_path].attrs['units']
value *= ureg(units)
except KeyError:
pass
h5.close()
return value

def write_file(self):
"""
Method for creating an auxiliary file to store big data arrays outside the
main archive file (e.g. HDF5, NeXus).
"""
if self.nexus:
try:
self._write_nx_file()
except Exception as e:
self.nexus = False
self.logger.warning(
f'Encountered "{e}" error while creating nexus file. '
'Creating h5 file instead.'
)
self._write_hdf5_file()
else:
self._write_hdf5_file()

def _write_nx_file(self):
"""
Method for creating a NeXus file. Additional data from the archive is added
to the `hdf5_data_dict` before creating the nexus file. This provides a NeXus
view of the data in addition to storing array data.
"""
if self.data_file.endswith('.h5'):
self.data_file = self.data_file.replace('.h5', '.nxs')
raise NotImplementedError('Method `write_nx_file` is not implemented.')
# TODO add archive data to `hdf5_data_dict` before creating the nexus file. Use
# `populate_hdf5_data_dict` method for each quantity that is needed in .nxs
# file. Create a NeXus file with the data in `hdf5_data_dict`.
# One issue here is as we populate the `hdf5_data_dict` with the archive data,
# we will always have to over write the nexus file

def _write_hdf5_file(self): # noqa: PLR0912
"""
Method for creating an HDF5 file.
"""
if self.data_file.endswith('.nxs'):
self.data_file = self.data_file.replace('.nxs', '.h5')
if not self.hdf5_datasets and not self.hdf5_attributes:
return
# remove the nexus annotations from the dataset paths if any
tmp_dict = {}
for key, value in self.hdf5_datasets.items():
new_key = self._remove_nexus_annotations(key)
tmp_dict[new_key] = value
tmp_dict[new_key]['hdf5_path'] = self._remove_nexus_annotations(
value['hdf5_path']
)
self.hdf5_datasets = tmp_dict
tmp_dict = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Convert for loop into dictionary comprehension (dict-comprehension)

for key, value in self.hdf5_attributes.items():
tmp_dict[self._remove_nexus_annotations(key)] = value
self.hdf5_attributes = tmp_dict

# create the HDF5 file
with self.archive.m_context.raw_file(self.data_file, 'a') as h5file:
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
h5 = h5py.File(h5file.name, 'a')
for key, value in self.hdf5_datasets.items():
if value['data'] is None:
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
self.logger.warning(f'No data found for "{key}". Skipping.')
continue
elif value['internal_reference']:
# resolve the internal reference
try:
data = h5[self._remove_nexus_annotations(value['data'])]
except KeyError:
self.logger.warning(
f'Internal reference "{data}" not found. Skipping.'
)
continue
else:
data = value['data']

group_name, dataset_name = key.rsplit('/', 1)
group = h5.require_group(group_name)

if key in h5:
group[dataset_name][...] = data
else:
group.create_dataset(
name=dataset_name,
data=data,
)
group[dataset_name].attrs.update(value['attrs'])
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
if value['archive_path'] is not None:
ka-sarthak marked this conversation as resolved.
Show resolved Hide resolved
self._set_hdf5_reference(
self.archive, value['archive_path'], value['hdf5_path']
)
for key, value in self.hdf5_attributes.items():
if key in h5:
h5[key].attrs.update(value)
else:
self.logger.warning(f'Path "{key}" not found to add attribute.')
h5.close()

# reset hdf5 datasets and atttributes
self.hdf5_datasets = collections.OrderedDict()
self.hdf5_attributes = collections.OrderedDict()

@staticmethod
def _remove_nexus_annotations(path: str) -> str:
"""
Remove the nexus related annotations from the dataset path.
For e.g.,
'/ENTRY[entry]/experiment_result/intensity' ->
'/entry/experiment_result/intensity'

Args:
path (str): The dataset path with nexus annotations.

Returns:
str: The dataset path without nexus annotations.
"""
if not path:
return path

pattern = r'.*\[.*\]'
new_path = ''
for part in path.split('/')[1:]:
if re.match(pattern, part):
new_path += '/' + part.split('[')[0].strip().lower()
else:
new_path += '/' + part
new_path = new_path.replace('.nxs', '.h5')
return new_path
Comment on lines +443 to +450
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
new_path = ''
for part in path.split('/')[1:]:
if re.match(pattern, part):
new_path += '/' + part.split('[')[0].strip().lower()
else:
new_path += '/' + part
new_path = new_path.replace('.nxs', '.h5')
return new_path
new_path = ''.join(
(
'/' + part.split('[')[0].strip().lower()
if re.match(pattern, part)
else f'/{part}'
)
for part in path.split('/')[1:]
)
return new_path.replace('.nxs', '.h5')


@staticmethod
def _set_hdf5_reference(section: 'ArchiveSection', path: str, ref: str):
"""
Method for setting a HDF5Reference quantity in a section. It can handle
nested quantities and repeatable sections, provided that the quantity itself
is of type `HDF5Reference`.
For example, one can set the reference for a quantity path like
`data.results[0].intensity`.

Args:
section (Section): The NOMAD section containing the quantity.
path (str): The path to the quantity.
ref (str): The reference to the HDF5 dataset.
"""
# TODO handle the case when section in the path is not initialized
attr = section
path = path.split('.')
quantity_name = path.pop()

for subpath in path:
if re.match(r'.*\[.*\]', subpath):
index = int(subpath.split('[')[1].split(']')[0])
attr = attr.m_get(subpath.split('[')[0], index=index)
else:
attr = attr.m_get(subpath)

if isinstance(
attr.m_get_quantity_definition(quantity_name).type, HDF5Reference
):
attr.m_set(quantity_name, ref)
Loading
Loading