-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hdf5 references for arrays #118
Use hdf5 references for arrays #118
Conversation
0bc1a1d
to
60e8a52
Compare
d583974
to
3aeb548
Compare
Co-authored-by: Sarthak Kapoor <[email protected]> Co-authored-by: Hampus Näsström <[email protected]>
…same methods for classes with hdf5 refs
3489de3
to
fcd585f
Compare
fcd585f
to
6c59dad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDFHandler intends to serve as interface to the write raw hdf5 files to the nomad file system. It also implements methods for special handling of nxs-related data.
For me, building a class is not necessary as this does not really represent an object
rather the read and write functions should simply be functions. In any case, I cannot see any major deviation from the the HDFReference writer except that it loops over given datasets. tbh, i am really uncomfortable with the idea of giving the developer-user access to the file system, i.e. the use of context.raw_files should only be done in nomad not in plugins. but this is only my opinion. I am not sure if there are more functionalities that will be added to the class, but as it stands,i am of the opinion that read and write functions suffice.
@ladinesa, the initial idea was to make these a function. But to have lazy writing, states like |
Yes, but you simply provide a list if of datasets to write if you do not like to write them individually. It is just a simpler implementation. |
5cb759a
to
122b65f
Compare
ec860a2
to
e3164ff
Compare
Reviewer's Guide by SourceryThis pull request implements a new HDF5Handler class to manage storage and retrieval of large array data in auxiliary HDF5 files within the ELN schema. It introduces lazy writing for datasets and attributes, handles unit conversion with pint.Quantity, and provides methods for writing to both .h5 and .nxs files. The schema plotting functionality is also adapted for HDF5 integration. Sequence diagram for HDF5 data handling processsequenceDiagram
participant Client
participant Handler as HDF5Handler
participant File as HDF5 File
participant Archive as NOMAD Archive
Client->>Handler: Initialize(filename, archive)
Client->>Handler: add_dataset(path, data)
Note over Handler: Stores data in memory
Client->>Handler: add_attribute(path, attrs)
Note over Handler: Stores attributes in memory
Client->>Handler: write_file()
Handler->>File: Create/Open file
Handler->>File: Write datasets
Handler->>File: Write attributes
Handler->>Archive: Set HDF5 references
Note over Handler: Reset internal storage
Class diagram for the new HDF5Handler and related classesclassDiagram
class HDF5Handler {
+str data_file
+EntryArchive archive
+BoundLogger logger
+list valid_dataset_paths
+bool nexus
-dict _hdf5_datasets
-dict _hdf5_attributes
+add_dataset(path: str, params: dict, validate_path: bool)
+add_attribute(path: str, params: dict)
+read_dataset(path: str)
+write_file()
-_write_nx_file()
-_write_hdf5_file()
-_remove_nexus_annotations(path: str) : str
-_set_hdf5_reference(section, path: str, ref: str)
}
class DatasetModel {
+Any data
+Optional[str] archive_path
+Optional[bool] internal_reference
}
class XRDResult {
+HDF5Reference intensity
+HDF5Reference two_theta
+HDF5Reference q_norm
+HDF5Reference omega
+HDF5Reference phi
+HDF5Reference chi
+HDF5Reference integration_time
}
HDF5Handler ..> DatasetModel : uses
XRDResult ..> HDF5Handler : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ka-sarthak - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider extracting common plotting logic from XRDResult1D and XRDResultRSM into a shared helper function to reduce code duplication
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/nomad_measurements/utils.py
Outdated
if validate_path and self.valid_dataset_paths: | ||
if path not in self.valid_dataset_paths: | ||
self.logger.warning(f'Invalid dataset path "{path}".') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Merge nested if conditions (merge-nested-ifs
)
if validate_path and self.valid_dataset_paths: | |
if path not in self.valid_dataset_paths: | |
self.logger.warning(f'Invalid dataset path "{path}".') | |
return | |
if validate_path and self.valid_dataset_paths and path not in self.valid_dataset_paths: | |
self.logger.warning(f'Invalid dataset path "{path}".') | |
return | |
Explanation
Too much nesting can make code difficult to understand, and this is especiallytrue in Python, where there are no brackets to help out with the delineation of
different nesting levels.
Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two if
conditions can be combined using
and
is an easy win.
src/nomad_measurements/xrd/schema.py
Outdated
result = XRDResultRSM() | ||
|
||
if result is not None: | ||
result.scan_axis = metadata_dict.get('scan_axis', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace dict.get(x, None)
with dict.get(x)
(remove-none-from-default-get
)
result.scan_axis = metadata_dict.get('scan_axis', None) | |
result.scan_axis = metadata_dict.get('scan_axis') |
Explanation
When using a dictionary'sget
method you can specify a default to return ifthe key is not found. This defaults to
None
, so it is unnecessary to specifyNone
if this is the required behaviour. Removing the unnecessary argumentmakes the code slightly shorter and clearer.
src/nomad_measurements/xrd/schema.py
Outdated
self.hdf5_handler.add_dataset( | ||
path='/ENTRY[entry]/experiment_result/intensity', | ||
params=dict( | ||
data=xrd_dict.get('intensity', None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace dict.get(x, None)
with dict.get(x)
(remove-none-from-default-get
)
data=xrd_dict.get('intensity', None), | |
data=xrd_dict.get('intensity'), |
Explanation
When using a dictionary'sget
method you can specify a default to return ifthe key is not found. This defaults to
None
, so it is unnecessary to specifyNone
if this is the required behaviour. Removing the unnecessary argumentmakes the code slightly shorter and clearer.
src/nomad_measurements/xrd/schema.py
Outdated
self.hdf5_handler.add_dataset( | ||
path='/ENTRY[entry]/experiment_result/two_theta', | ||
params=dict( | ||
data=xrd_dict.get('2Theta', None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace dict.get(x, None)
with dict.get(x)
(remove-none-from-default-get
)
data=xrd_dict.get('2Theta', None), | |
data=xrd_dict.get('2Theta'), |
Explanation
When using a dictionary'sget
method you can specify a default to return ifthe key is not found. This defaults to
None
, so it is unnecessary to specifyNone
if this is the required behaviour. Removing the unnecessary argumentmakes the code slightly shorter and clearer.
src/nomad_measurements/xrd/schema.py
Outdated
self.hdf5_handler.add_dataset( | ||
path='/ENTRY[entry]/experiment_result/omega', | ||
params=dict( | ||
data=xrd_dict.get('Omega', None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace dict.get(x, None)
with dict.get(x)
(remove-none-from-default-get
)
data=xrd_dict.get('Omega', None), | |
data=xrd_dict.get('Omega'), |
Explanation
When using a dictionary'sget
method you can specify a default to return ifthe key is not found. This defaults to
None
, so it is unnecessary to specifyNone
if this is the required behaviour. Removing the unnecessary argumentmakes the code slightly shorter and clearer.
if isinstance(parsed_archive.data.results[0], XRDResult1D): | ||
assert parsed_archive.results.properties.structural.diffraction_pattern[ | ||
0 | ||
].incident_beam_wavelength.magnitude * 1e10 == pytest.approx(1.540598, 1e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests
)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
src/nomad_measurements/utils.py
Outdated
units = self._hdf5_attributes[dataset_path].get('units', None) | ||
if units: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression
)
units = self._hdf5_attributes[dataset_path].get('units', None) | |
if units: | |
if units := self._hdf5_attributes[dataset_path].get('units', None): |
new_key = self._remove_nexus_annotations(key) | ||
tmp_dict[new_key] = value | ||
self._hdf5_datasets = tmp_dict | ||
tmp_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Convert for loop into dictionary comprehension (dict-comprehension
)
new_path = '' | ||
for part in path.split('/')[1:]: | ||
if re.match(pattern, part): | ||
new_path += '/' + part.split('[')[0].strip().lower() | ||
else: | ||
new_path += '/' + part | ||
new_path = new_path.replace('.nxs', '.h5') | ||
return new_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Use str.join() instead of for loop (
use-join
) - Replace if statement with if expression (
assign-if-exp
) - Inline variable that is immediately returned (
inline-immediately-returned-variable
) - Use f-string instead of string concatenation (
use-fstring-for-concatenation
)
new_path = '' | |
for part in path.split('/')[1:]: | |
if re.match(pattern, part): | |
new_path += '/' + part.split('[')[0].strip().lower() | |
else: | |
new_path += '/' + part | |
new_path = new_path.replace('.nxs', '.h5') | |
return new_path | |
new_path = ''.join( | |
( | |
'/' + part.split('[')[0].strip().lower() | |
if re.match(pattern, part) | |
else f'/{part}' | |
) | |
for part in path.split('/')[1:] | |
) | |
return new_path.replace('.nxs', '.h5') |
), | ||
) | ||
elif self.q_parallel is not None and self.q_perpendicular is not None: | ||
intensity = hdf5_handler.read_dataset(self.intensity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Extract code out into method (extract-method
)
Hi @ka-sarthak, after checking with you, as mentioned to create another PR regarding the nexus part to branch: use-hdf5-references-for-arrays, here is the PR: #147. I put you and @hampusnasstrom as reviewers for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Merging this branch into the branch associated with #113 |
* updated plugin structure * added pynxtools dependency * Apply suggestions from code review Co-authored-by: Sarthak Kapoor <[email protected]> Co-authored-by: Hampus Näsström <[email protected]> * Add sections for RSM and 1D which uses HDF5 references * Abstract out data interaction using setter and getter; allows to use same methods for classes with hdf5 refs * Use arrays, not references, in the `archive.results` section * Lock the state for using nexus file and corresponding references * Populate results without references * Make a general reader for raw files * Remove nexus flags * Add quantity for auxialiary file * Fix rebase * Make integration_time as hdf5reference * Reset results (refactor) * Add backward compatibility * Refactor reader * add missing imports * AttrDict class * Make concept map global * Add function to remove nexus annotations in concept map * Move try block inside walk_through_object * Fix imports * Add methods for generating hdf5 file * Rename auxiliary file * Expect aux file to be .nxs in the beginning * Add attributes for hdf5: data_dict, dataset_paths * Method for adding a quantity to hdf5_data_dict * Abstract out methods for creating files based on hdf5_data_dict * Add dataset_paths for nexus * Some reverting back * Minor fixes * Refactor populate_hdf5_data_dict: store a reference to be made later * Handle shift from nxs to hdf5 * Set hdf5 references after aux file is created * Cleaning * Fixing * Redefine result sections instead of extending * Remove plotly plots from ELN * Read util for hdf5 ref * Fixing * Move hdf5 handling into a util class * Refactor instance variables * Reset data dicts and reference after each writing * Fixing * Overwrite dataset if it already exists * Refactor add_dataset * Reorganize and doctrings * Rename variable * Add read_dataset method * Cleaning * Adapting schema with hdf5 handler * Cooments, minor refactoring * Fixing; add `hdf5_handler` as an attribute for archive * Reorganization * Fixing * Refactoring * Cleaning * Try block for using hdf5 handler: dont fail early, as later normalization steps will have the handler! * Extract units from dataset attrs when reading * Fixing * Linting * Make archive_path optional in add_dataset * Rename class * attrs for add_dataset; use it for units * Add add_attribute method * Refactor add_attribute * Add plot attributes: 1D * Refactor hdf5 states * Add back plotly figures * rename auxiliary file name if changed by handler * Add referenced plots * Allow hard link using internel reference * Add sections for plots * Comment out validation * Add archive paths for the plot subsections * Add back validation with flag * Use nexus flag * Add interpolated intensity data into h5 for qspace plots * Use prefix to reduce len of string * Store regularized linespace of q vectors; revise descriptions * Remove plotly plots * Bring plots to overview * Fix tests * Linting; remove attr arg from add_dataset * Review: move none check into method * Review: use 'with' for opening h5 file * Review: make internal states as private vars * Add pydantic basemodel for dataset * Use data from variables if available for reading * Review: remove lazy arg * Move DatasetModel outside Handler class * Remove None from get, as it is already a default * Merge if conditions --------- Co-authored-by: Andrea Albino <[email protected]> Co-authored-by: Andrea Albino <[email protected]> Co-authored-by: Hampus Näsström <[email protected]>
Adds a class
HDF5Handler
which can be used to create and manipulate an auxiliary HDF5 file from within an ELN schema. This file is intended to be used to offload large arrays of data into the HDF5 file and store references in the archive. The following methods are defined for this class:add_dataset
: allows to add a dataset to a specified path in the HDF5 file. Uses lazy writing and stores the data in instance variables. Only whenwrite
file methods are triggered, the data is written into the file. Populates the instance variablehdf5_datasets: dict
add_attribute
: allows to populateattrs
dictionary of the HDF5 datasets and groups. Also uses lazy writing. Attributes are used to store information about units and the generation of plots. Populates the instance variablehdf5_attributes: dict
read_dataset
: reads the dataset from the HDF5 file at the specified path, and if units are available in theattrs
dictionary, a correspondingpint.Quantity
is returned. The method ensures that before reading, all the data added using theadd_dataset
method is written in the file._write_hdf5_file
: goes through the instance variableshdf5_datasets
andhdf5_attributes
and populates a.h5
file with data and attributes. It also automatically creates groups as required by the dataset paths. Once the data from the instance variables is written into the file, they are reset to empty dictionaries. Additionally, if the corresponding archive paths to theHDF5Reference
quantities are available for the datasets, the archive quantities are populated with references._write_nx_file
: a similar method as above that generates a.nxs
file instead of a.h5
file.write_file
: a wrapper for toggling between the above two methods._set_hdf5_reference
: a static method to set the value of the HDF5Reference quantity for a given section, section path and HDF5 dataset path._remove_nx_annotations
: a static method that removes nexus-related annotations from dataset paths. With this, one can add nexus-specific paths in theadd_dataset
method and the handler can convert them into general paths if required.The handler is intended to be used in the following way. This is also the general outline of the implementation in the
schema.py
file..nxs
or.h5
extensions), entry archive, logger, and an optional list of valid dataset paths.pint.Quantity
, it is stored as a numpy array and the unit is stored as an attribute of the dataset.NXclass
,axes
, andsignal
to generate plots using H5Web.normalize
method of sub-sections and parent sections to compute other properties.Some additional classes and code have been added to adapt the schema's plotting functionality with HDF5 implementation.
Summary by Sourcery
Implement lazy loading of large arrays using HDF5References and an auxiliary HDF5 file. Add an
HDF5Handler
class to manage the creation and manipulation of the HDF5 file, including adding datasets, attributes, and reading datasets. Adapt the schema's plotting functionality to work with the HDF5 implementation.New Features:
HDF5Handler
class to manage large arrays in an auxiliary HDF5 file, which supports lazy loading and storing references in the archive.Tests: