Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XRD Hackathon: using a nexus subsection #135

Open
budschi opened this issue Nov 8, 2024 · 8 comments
Open

XRD Hackathon: using a nexus subsection #135

budschi opened this issue Nov 8, 2024 · 8 comments
Assignees

Comments

@budschi
Copy link
Contributor

budschi commented Nov 8, 2024

Since the nexus section resolved into data recently, we have to use now a subsection which we could call nexus instead of the nexus section we had beofore. Otherwise there will be an overwriting conflict.

@sanbrock @markus1978

@RubelMozumder
Copy link
Contributor

@budschi FYI: The PR #116 has not been yet merged into the main branch.

The creation of a nexus sub-section in the EntryArchive or in EntryArchive/data is not an issue. Currently, in PR #116 data comes from the Nexus file using the hdf5 reference recipe. There are no nexus subsections anywhere. I tested the XRD measurement plugin there are no issues. @ka-sarthak would you please confirm it?

@budschi
Copy link
Contributor Author

budschi commented Nov 14, 2024

Sandor and I discussed the following today:

  • workflow
    NOMAD Measurement Plugin = NMP
    3rdparty = follwing an App Def, may also contain alien information (not in NMP)
      1. myxrd.xrdml --> NMP --> ELNXRD view of myxrd.archive.json
      1. myxrd.xrdml --> NMP+App. Def. --> ELNXRD view of myxrd.archive.json + myxrd.nxs (ELN has references to nxs)
        missing: "Nexus" Section --> to be replaced by a Reference to a Nexus entry
      1. 3rdpartyxrd.nxs --> NMP + App.Def. --> 3rdpartyxrd.nxs (same as the referenced on in 2.)
        missing: NMP would need to create an ELN with 3rdpartyxrd.archive.json and establish a reference
  • questions, comments to workflow:
    • does 1. + 2. result in the same "ELN" in NOMAD?
    • missing in 2:
    • does 3. exist? --> should be there!
  • Implementation workflow
    • scenario:

      1. custom plugin for XRD --> branch of NM with this method
      2. NOMAD Measurement plugin with structure according to [nomad-plugin-template](https://github.com/FAIRmat-NFDI/nomad-plugin-template)
      3. Nxs app def available --> additional support for Nexus standard
    • what consequences does 3 have on implementation? Need for pynx tool? etc?

      • requires use of pynxtools --> generates *.nxs
      • file readers in pynxtool:
        • may exist --> nothing todo
        • may not exist as we want (missing custom ELNXRD fields): extend reader to also populate extra fields in Nexus file
        • no reader for input format: build new reader from an example in pynxtool --> pynxtool plugin in github
        • when you write pynxtool reader you can reuse reader functionalities from elsewhere by importing!
      • once *.nxs exists:
        • change NM parser to support references --> ideally to nexus entry not a nxs/h5 file
    • Pynxtool plugin:

  • Next milestone
    • making relationships between the concepts making the references in the ideal way
    • how reuse NeXus description iin the backgrond and not in the ELn view

@ka-sarthak
Copy link
Collaborator

ka-sarthak commented Nov 19, 2024

@budschi thanks for sharing the input.

After discussions with @aalbino2, it's still not clear if we should create an entry for .nxs file or have nexus section composed in the data section of the ELN. Contrary to what I shared before, I think having a nexus section inside the ELN is a better option, especially from the parsing point of view. Jump to the reason in "Parser Modification" section below.

To share the line of thought, I assume for the First Attempt that we are creating a nexus entry for .nxs file.

First attempt

I am rewriting the 3 situations (or workflows) @budschi mentioned above just to be on the same page:

  1. Drop .xrdml file and get an ELN view of the data (current situation)
  2. Drop .xrdml file and get an ELN along with .nxs file. ELN will reference some data from the .nxs file. Additionally, .nxs file will have its nexus entry.
  3. Drop .nxs file and get a nexus entry. Additionally, get an ELN based on the data from .nxs file.

Situation 2 and 3 have similar end states, but different starting points. In situation 2, ELN can be populated with data from .xrdml. In situation 3, we don't have a .xrdml file, it needs to read all the data from .nxs.

Situation 2

What data do we outsource and what do we reference from .nxs:

The pain point here is that *.archive.json files can get heavy when array data is present, esp. if arrays are
multi-dimensional. Most of these quantities for XRD are in data.results. We outsource this data into .nxs and save references in the *.archive.json. Quantities in data.settings are mostly of primitive type, therefore offer low memory/storage footprint. Also, these are useful to populate filters and widgets in apps. So we can keep them as-is. Nevertheless, we can still outsource this data to the .nxs file to facilitate a nexus view of the entire available data. However, this leads to data duplication.

The ELN normalization will do the following:

  • Parser matches .xrdml
    • creates an ELN entry
    • set data_file as .xrdml
  • Reader
    • read the .xrdml file into data_dict
    • generate a .nxs file based on data_dict
    • modify the array quantities in data_dict: replace the actual data with reference (.nxs path to the data)
    • return the data_dict to the ELN
  • Writer
    • Add a reference to the processed nexus entry based on the generated .nxs.
    • Populates the schema as-is.

Situation 3

The goal is to get all data from the .nxs file that can be mapped to the ELN schema.

The ELN normalization will do the following:

  • set data_file as .nxs
  • Reader
    • initializes data_dict
    • for the array quantities add the .nxs paths to the data_dict
    • for the remaining quantities, add the data itself to data_dict
  • Writer
    • same as for situation 2.

Parser modification

For situation 3, we will have to add a matching parser for .nxs and condition it on the content of the file (to only match NxXRD). This parser will trigger the creation and population of ELN. But there's an issue here. In situation 2, we are creating the .nxs file in the ELN normalization. If a matching parser for this XRD nexus file exists, it will create a second copy of the ELN.

This brings me to the Second Attempt where we do not create an entry for the .nxs XRD file. Rather, .nxs XRD file is matched by a parser which creates an ELN and adds a data.nexus sub-section in it.

Second Attempt

Rather than going directly into the normalization, we use the parser to handle different types of file. The matching parser for raw XRD files like .xrdml generates a .nxs file (or a .h5 file as a failsafe). A matching parser for .nxs file creates ELN. This assumes that there is no parser for .nxs file from pynxtools to create a nexus entry. So no collisions.

Situation 2

  • Parser 1 matches .xrdml file
    • generates a .nxs file based on the data
    • if the creation of .nxs file runs into errors, generates a .h5 file with the same tree structure.
    • All the data from .xrdml is transferred into the .nxs or .h5 file.
  • Parser 2 matches .nxs file
    • creates an ELN entry
    • sets the data_file as .nxs
  • (optionally) Parser 3 matches .h5 file
    • creates an ELN entry
    • sets the data_file as .h5

From here on, we have a common interface for the ELN which either deals with a .nxs or .h5 file.

  • Reader
    • read the .nxs or .h5 file in the same way into data_dict
      • for the array quantities (results) add the .nxs path to the data_dict
      • for the remaining quantities, add the data itself to data_dict
  • Writer
    • Populates the schema as-is.

Situation 3

  • Parser 2 matches .nxs file
    • creates an ELN entry
    • sets the data_file as .nxs
  • Reader
    • Same as workflow 2
  • Writer
    • Same as workflow 2

Sorry for the long albeit non-circumventable comment!

@ka-sarthak
Copy link
Collaborator

@hampusnasstrom @aalbino2 I am leaning towards "Second Attempt". It will be great to have your input before I start implementing.

@ka-sarthak
Copy link
Collaborator

Also, in this thread, we are using .xrdml as a placeholder for raw XRD files coming from the instrument. This could also be .rasx or .brml

@hampusnasstrom
Copy link
Collaborator

I think I need this described in person.

@RubelMozumder
Copy link
Contributor

RubelMozumder commented Nov 20, 2024

I'm sorry. Last week I was busy with SPM paper writing, it is still not completely done and planned to continue until next month, but I think I can give my input here in this project, along with that project.
@ka-sarthak, at some point we need to merge solutions 1 (current stage),2, and 3 (if three come together) to keep the NMP backward compatible.
A comment on solution (3). If .nxs or .h5 is a human-generated file, It might contain some fundamental error (e.g. file is properly written but app def is not written which is very easy to forget or units for an entity) and the parser may fail on it (Previously I had such experience). Such failure would give bed perception of the tool.

Regarding solution-2, the PR #116 has most of the work done.

For solution-3,
I can write a parser for .nxs or .h5, if you like.

I can join to the meeting you arrange for this issue.

@ka-sarthak
Copy link
Collaborator

We met in Area A today and decided to tackle situation 2 for now.

Essentially having the following workflow implemented:

  • drop a .xrdml file
  • matching parser creates an ELN
  • In the ELN normalization
    • Reader creates a data_dict from .xrdml
      • data_dict is used to generate .nxs file
      • if the above step fails, a .h5 file is generated instead
      • array data in data_dict is overwritten with str: paths in the .nxs or .h5 file
    • Writer uses the data_dict to populate the schema

As @RubelMozumder mentioned, the nexus side of this is available in #116. We will need to check if it's still compatible with the current state of pynxtools.

The modifications in the ELN schema will be done in #118.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants