Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hdf5 or nexus file in XRD #113

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft

Use hdf5 or nexus file in XRD #113

wants to merge 26 commits into from

Conversation

ka-sarthak
Copy link
Collaborator

@ka-sarthak ka-sarthak commented Aug 28, 2024

When array data from XRD measurements is added to the archives, the loading time increases as the archives become heavier (especially in the case of RSM which stores multiple 2D arrays). One solution is to use an auxiliary file to offload the heavy data and only save references to the auxiliary files in the archives.

To implement, we can use .h5 files to store the data and make references to the offloaded datasets using HDF5Reference. Additionally, we can also generate a nexus .nx file instead of .h5 file. Nexus file uses the .h5 file as the base file type and validates the data with the data models built by the Nexus community.

The current plots are generated using Plotly. The .json files containing the plot data is also being stored in the archive. This also needs to be offloaded to make the archives lighter. Using H5WebAnnotations of NOMAD, we can leverage the H5Web to generate plots from the .h5 or .nx files.

To this end, the following steps are needed

  • Use HDF5Reference as the type of the Quantity for array data: intensity, two_theta, q_parallel, q_perpendicular, q_norm, omega, phi, chi.
  • Implement util class HDF5Handler or functions to create auxiliary files from the normalizers of the schema
  • Generate a .h5 to store the data and save references to its datasets in HDF5Reference quantities.
  • Generate a .nxs file based on the archive. This happens in the HDF5Handler and uses pynxtools.
  • Add annotations in auxiliary files to generate plots for H5Web viewer
  • Add back compatibility

@ka-sarthak ka-sarthak self-assigned this Aug 28, 2024
@ka-sarthak ka-sarthak force-pushed the write-nexus-section branch 2 times, most recently from f2bef40 to d583974 Compare September 3, 2024 08:55
@ka-sarthak ka-sarthak changed the title Use nexus section in XRD Use hdf5/nexus file in XRD Dec 19, 2024
@ka-sarthak ka-sarthak changed the title Use hdf5/nexus file in XRD Use hdf5 or nexus file in XRD Dec 19, 2024
@ka-sarthak ka-sarthak marked this pull request as draft December 19, 2024 15:16
@ka-sarthak
Copy link
Collaborator Author

ka-sarthak commented Dec 19, 2024

@hampusnasstrom @aalbino2 I merged the implementation of the HDF5Handler and support for .h5 file as an auxiliary file.

The Plotly plots are removed in favor of the plots from H5Web. @budschi's current viewpoint is that Plotly plots have better visualizations and it might be a good idea to preserve them for 1D scans. This can be a point of discussion when we review this PR after the vacations

@RubelMozumder will soon merge his implementations from #147 which will allow to use .nx file as an auxiliary file

@ka-sarthak
Copy link
Collaborator Author

@RubelMozumder I have combined the common functionality from walk_through_object and _set_hdf5_ref into one util function resolve_path

@ka-sarthak
Copy link
Collaborator Author

TODO

  • Combine the mapping in nx.py which is ingested by the Handler as an argument.
  • Try to overwrite the .nxs file without deleting the mainfile. As per @TLCFEM, we should avoid deleting the mainfile

@TLCFEM
Copy link

TLCFEM commented Dec 20, 2024

Have you checked what is the root cause of the issue?
Is the file still occupied when it is read by something elese?

@ka-sarthak
Copy link
Collaborator Author

@TLCFEM I wasn't able to investigate it yet. But this will be among the first things I do in the new year and will reach out to you with my findings. Happy Holidays!

@TLCFEM
Copy link

TLCFEM commented Dec 20, 2024

If it is not the case, then all discussions are not valid anymore.
So you check the access pattern first.
HDF5 has quite a few caveats and requires some knowledge of how things work internally.

@RubelMozumder
Copy link
Contributor

If it is not the case, then all discussions are not valid anymore. So you check the access pattern first. HDF5 has quite a few caveats and requires some knowledge of how things work internally.

If I explain the situation that may lay bear the scenario.
We have eln that takes input put file. In the processing of the eln object or archive.json it generates a output file of type .h5 or .nxs file. If it is a .nxs file then nomad index it as an entry.
So, In the first attempt of eln processing, there is no error and all looks goods.

Issue: In the second attempt of reprocessing the entire upload (archive.json, .nxs and so on) nomad starts processing the archive.json and .nxs (nomad entry). The reprocessing of the archive.json also recreates the .nxs file then the issue comes in place. As far as I understand there are two worker processes working on the same file object .nxs concurrently.

Temporary solution:
In each processing of the archive.json we delete the .nxs (nomad entry) file if it exists and regenerate the .nxs file again. Which might not be the right approach to handle this case.

@aalbino2
Copy link
Contributor

aalbino2 commented Jan 7, 2025

If it is not the case, then all discussions are not valid anymore. So you check the access pattern first. HDF5 has quite a few caveats and requires some knowledge of how things work internally.

If I explain the situation that may lay bear the scenario. We have eln that takes input put file. In the processing of the eln object or archive.json it generates a output file of type .h5 or .nxs file. If it is a .nxs file then nomad index it as an entry. So, In the first attempt of eln processing, there is no error and all looks goods.

Issue: In the second attempt of reprocessing the entire upload (archive.json, .nxs and so on) nomad starts processing the archive.json and .nxs (nomad entry). The reprocessing of the archive.json also recreates the .nxs file then the issue comes in place. As far as I understand there are two worker processes working on the same file object .nxs concurrently.

Temporary solution: In each processing of the archive.json we delete the .nxs (nomad entry) file if it exists and regenerate the .nxs file again. Which might not be the right approach to handle this case.

@RubelMozumder what prevents you from checking the existence of the .nxs file and create a new one only in the case it doesn't exist yet?

RubelMozumder and others added 7 commits January 14, 2025 10:58
* Implement write nexus section based on the populated nomad archive

* app def missing.

* mapping nomad_measurement.

* All concept are connected, creates nexus file and subsection.

* adding links in hdf5 file.

* Remove the nxs file.

* back to the previous design.

* Include pynxtools plugins in nomad.yaml and extend dependencies including pynxtools ans pnxtools-xrd.

* PR review correction.

* Remove the entry_type overwtitten.

* Remove comments.

* Replace __str__ function.

* RUFF

* Update pyproject.toml

Co-authored-by: Sarthak Kapoor <[email protected]>

* Update src/nomad_measurements/xrd/schema.py

Co-authored-by: Sarthak Kapoor <[email protected]>

* Update src/nomad_measurements/xrd/nx.py

* Replace Try-block.

---------

Co-authored-by: Sarthak Kapoor <[email protected]>
Co-authored-by: Sarthak Kapoor <[email protected]>
* updated plugin structure

* added pynxtools dependency

* Apply suggestions from code review

Co-authored-by: Sarthak Kapoor <[email protected]>
Co-authored-by: Hampus Näsström <[email protected]>

* Add sections for RSM and 1D which uses HDF5 references

* Abstract out data interaction using setter and getter; allows to use same methods for classes with hdf5 refs

* Use arrays, not references, in the `archive.results` section

* Lock the state for using nexus file and corresponding references

* Populate results without references

* Make a general reader for raw files

* Remove nexus flags

* Add quantity for auxialiary file

* Fix rebase

* Make integration_time as hdf5reference

* Reset results (refactor)

* Add backward compatibility

* Refactor reader

* add missing imports

* AttrDict class

* Make concept map global

* Add function to remove nexus annotations in concept map

* Move try block inside walk_through_object

* Fix imports

* Add methods for generating hdf5 file

* Rename auxiliary file

* Expect aux file to be .nxs in the beginning

* Add attributes for hdf5: data_dict, dataset_paths

* Method for adding a quantity to hdf5_data_dict

* Abstract out methods for creating files based on hdf5_data_dict

* Add dataset_paths for nexus

* Some reverting back

* Minor fixes

* Refactor populate_hdf5_data_dict: store a reference to be made later

* Handle shift from nxs to hdf5

* Set hdf5 references after aux file is created

* Cleaning

* Fixing

* Redefine result sections instead of extending

* Remove plotly plots from ELN

* Read util for hdf5 ref

* Fixing

* Move hdf5 handling into a util class

* Refactor instance variables

* Reset data dicts and reference after each writing

* Fixing

* Overwrite dataset if it already exists

* Refactor add_dataset

* Reorganize and doctrings

* Rename variable

* Add read_dataset method

* Cleaning

* Adapting schema with hdf5 handler

* Cooments, minor refactoring

* Fixing; add `hdf5_handler` as an attribute for archive

* Reorganization

* Fixing

* Refactoring

* Cleaning

* Try block for using hdf5 handler: dont fail early, as later normalization steps will have the handler!

* Extract units from dataset attrs when reading

* Fixing

* Linting

* Make archive_path optional in add_dataset

* Rename class

* attrs for add_dataset; use it for units

* Add add_attribute method

* Refactor add_attribute

* Add plot attributes: 1D

* Refactor hdf5 states

* Add back plotly figures

* rename auxiliary file name if changed by handler

* Add referenced plots

* Allow hard link using internel reference

* Add sections for plots

* Comment out validation

* Add archive paths for the plot subsections

* Add back validation with flag

* Use nexus flag

* Add interpolated intensity data into h5 for qspace plots

* Use prefix to reduce len of string

* Store regularized linespace of q vectors; revise descriptions

* Remove plotly plots

* Bring plots to overview

* Fix tests

* Linting; remove attr arg from add_dataset

* Review: move none check into method

* Review: use 'with' for opening h5 file

* Review: make internal states as private vars

* Add pydantic basemodel for dataset

* Use data from variables if available for reading

* Review: remove lazy arg

* Move DatasetModel outside Handler class

* Remove None from get, as it is already a default

* Merge if conditions

---------

Co-authored-by: Andrea Albino <[email protected]>
Co-authored-by: Andrea Albino <[email protected]>
Co-authored-by: Hampus Näsström <[email protected]>
* Remove the Nexus file before regenerating it.


* Reference to the NeXus entry.

* PR review comments.
@ka-sarthak
Copy link
Collaborator Author

After discussing with @TLCFEM, we found the following things:

  • There is a resource contention issue, where multiple processes try to access the generated nexus file in different modes (read and write). Generation of the nexus file is not the problem, but triggering a reprocess using m_context.process_updated_raw_file(filename, allow_modify=True) from the ELN normalizer can lead to resource contention. This is because a new worker is assigned for this reprocess in parallel to the worker which is handling the normalization. ELN normalization worked might have the nexus file open in write mode, while the reprocess worker tries to open it in read mode to process the nexus entry.
  • The behavior is unpredictable, as sometimes the entry normalization can happen without the resource contention error, and other times, it might get one.

Some directions for resolving this:

  • Use sleep timers in the nexus processing that is triggered by the nexus parser. This allows the ELN process to be completed (and the file is closed) before the processing of the nexus entry is triggered. However, this isn't a solution as one can't know what timer value fits all cases.
  • Delete the nexus file if exists before triggering the nexus file writing from the ELN. This makes sure that no nexus entry is being processed during the nexus file writing process.
  • Do not reprocess m_context.process_updated_raw_file(filename, allow_modify=True) through the normalizer. This avoids entering the resource contention situation. Rather, the user can trigger a reprocess of the upload from the GUI. Drawback: user inconvenience.
  • Enforce the reprocess m_context.process_updated_raw_file(filename, allow_modify=True) to use the current process, rather than creating a new worker for it. This can be done by using entry.process_entry_local() instead of entry.process_entry() look here.

@ka-sarthak
Copy link
Collaborator Author

Currently, the handler exposes the write_file method that can be used at any point multiple times during the normalization. We should limit this so that the resource contention problems are more tractable. One write file per normalization, this also allows the nexus entry to contain the latest changes of the nexus file.

@RubelMozumder
Copy link
Contributor

After discussing with @TLCFEM, we found the following things:

  • There is a resource contention issue, where multiple processes try to access the generated nexus file in different modes (read and write). Generation of the nexus file is not the problem, but triggering a reprocess using m_context.process_updated_raw_file(filename, allow_modify=True) from the ELN normalizer can lead to resource contention. This is because a new worker is assigned for this reprocess in parallel to the worker which is handling the normalization. ELN normalization worked might have the nexus file open in write mode, while the reprocess worker tries to open it in read mode to process the nexus entry.
  • The behavior is unpredictable, as sometimes the entry normalization can happen without the resource contention error, and other times, it might get one.

Some directions for resolving this:

  • Use sleep timers in the nexus processing that is triggered by the nexus parser. This allows the ELN process to be completed (and the file is closed) before the processing of the nexus entry is triggered. However, this isn't a solution as one can't know what timer value fits all cases.
  • Delete the nexus file if exists before triggering the nexus file writing from the ELN. This makes sure that no nexus entry is being processed during the nexus file writing process.
  • Do not reprocess m_context.process_updated_raw_file(filename, allow_modify=True) through the normalizer. This avoids entering the resource contention situation. Rather, the user can trigger a reprocess of the upload from the GUI. Drawback: user inconvenience.
  • Enforce the reprocess m_context.process_updated_raw_file(filename, allow_modify=True) to use the current process, rather than creating a new worker for it. This can be done by using entry.process_entry_local() instead of entry.process_entry() look here.

It may resolve the race condition reading/writing function on the same file. There is another issue,
Let's suppose the first time the raw file processing nexus writer succeeds and creates a nexus entry. On the second attempt, for some reason, the nexus process fails, but the entry is still there from the first process. It needs to delete the nexus file and entry as well and write an hdf5 file.

I think that needs to be fixed by area-D to delete an entry (corrupted) and its related file from the single process thread running normalizer.

The PR: #157 can help, you see that the test is completely failed.

@RubelMozumder
Copy link
Contributor

@lauri-codes, is there any functionality that deletes a entry, associated mainfile and the residue (if there is something e.g. ES data) of that deleted entry? This deletion must happens inside the eln normalization process.

Just a quick overview of implementation:

try:
   create a nexus file which ends up as a nexus entry
Except Error:
   Delete nexus mainfile, entry and residue meatadata
   create hdf file (hdf5 is not a nomad entry)

Then we make reference to concepts in nexus file or hdf5 file for entry quantities.

Currently,
We are using os.remove to delete a mainfile (which we believe not a correct way to do), still the mainfile deletion does not delete the entry and its matadata.

You may want to take a quick view of code in function write_file here:

I have created a small function to delete mainfile, entry and ES (here:

def delete_entry_file(archive, mainfile, delete_entry=False):
(this raise an error from different process I can not trace back to my code from where the error is coming. It also fail to eln entry normalization process.

If you could please suggest any functionality that is available in NOMAD.

@lauri-codes
Copy link

@RubelMozumder: There is no such functionality, and I doubt there ever will be. Deleting entries during processing is not something we can really endorse in any way: there are too many ways to screw this up (what happens if the entry is deleted and then an expection happens before the new data is stored? What happens when some other processed entry tries to read the deleted entry simultaneously? What happens if the file is opened by another process and there is a lock on it when someone tries to delete it?)

I would instead want to try and understand what is the goal you are trying to achieve with this normalizer. It is reasonable to create temporary files during normalization and also reasonable to create new entries at the end of normalization (assuming there are no circular processing steps or parallel processes that might cause issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants