Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: formula parsing in multiformatreader #524

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 30 additions & 5 deletions docs/how-tos/use-multi-format-reader.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# How to use the built-in MultiFormatReader

While building on the ```BaseReader``` allows for the most flexibility, in most cases it is desirable to implement a reader that can read in multiple file formats and then populate the template based on the read data. For this purpose, `pynxtools` has the [**`MultiFormatReader`**](https://github.com/FAIRmat-NFDI/pynxtools/blob/master/src/pynxtools/dataconverter/readers/multi/reader.py), which can be readily extended for your own data. In this how-to guide, we will focus on an implementation using a concrete example. If you are also interested in the general structure of the `MultiFormatReader`, you can find more information [here](../learn/multi-format-reader.md).

## Getting started
Expand Down Expand Up @@ -69,8 +70,8 @@ The NXDL requires a user, some sample information, some instrument metadata, and

Note that in order to be recognized as a valid application definition, this file should be copied to the `definitions` submodule at `pynxtools.definitions`.


We first start by implementing the class and its ``__init__`` call:

```python title="reader.py"
"""MyDataReader implementation for the DataConverter to convert mydata to NeXus."""
from typing import Tuple, Any
Expand All @@ -97,19 +98,23 @@ class MyDataReader(MultiFormatReader):

READER = MyDataReader
```

Note that here we are adding handlers for three types of data file extensions:

1. `".hdf5"`, `".h5"`: This will be used to parse in the (meta)data from the instrument's HDF5 file.
2. `".yml"`, `".yaml"`: This will be used to parse in the (meta)data from the ELN file.
3. `".json"`: This will be used to read in the **config file**, which is used to map from the (meta)data concepts from the instrument and ELN data to the concepts in the NXDL file.

## Reading in the instrument's data and metadata

First, we will have a look at the HDF5 file. This mock HDF5 file was generated with `h5py` using a [simple script](https://github.com/FAIRmat-NFDI/pynxtools/tree/master/examples/mock-data-reader/create_mock_data.py).

<img src="media/mock_data.png" style="width: 50vw; min-width: 330px;" />
<img src="media/mock_data.png" alt="Mock Data" style="width: 50%; min-width: 330px;" />

Here, we see that we have a `data` group with x and y values, as well as some additional metadata for the instrument.

Here is one way to implement the method to read in the data:

```python title="reader.py"
import h5py

Expand All @@ -132,9 +137,11 @@ def handle_hdf5_file(filepath):

return {}
```

Note that here we are returning an empty dictionary because we don't want to fill the template just yet, but only read in the HDF5 data for now. We will use the config file later to fill the template with the read-in data. Note that it is also possible to return a dictionary here to update the template directly.

`self.hdf5_data` will look like this:

```python
{
"data/x_values": array([-10. , -9.7979798 , -9.5959596 , ..., 10. ]),
Expand All @@ -147,9 +154,12 @@ Note that here we are returning an empty dictionary because we don't want to fil
"metadata/instrument/detector/count_time_units": s",
}
```

## Reading in ELN data

As we can see in the application definition `NXsimple` above, there are some concepts defined for which there is no equivalent metadata in the HDF5 file. We are therefore using a YAML ELN file to add additional metadata.
The ELN file `eln_data.yaml` looks like this:

```yaml title="eln_data.yaml"
title: My experiment
user:
Expand Down Expand Up @@ -188,7 +198,9 @@ def handle_eln_file(self, file_path: str) -> Dict[str, Any]:

return {}
```

When this method is called, `self.eln_data` will look like this:

```python
{
"/ENTRY[entry]/title": "My experiment",
Expand All @@ -200,17 +212,19 @@ When this method is called, `self.eln_data` will look like this:
"/ENTRY[entry]/SAMPLE[sample]/temperature/@units": "K"
}
```

Note that here we are using `parent_key="/ENTRY[entry]"` as well as a `CONVERT_DICT`, meaning that each key in `self.eln_data` will start with `"/ENTRY[entry]"` and some of the paths will be converted to match the template notation. This will be important later.

## Parsing the config file

Next up, we can make use of the config file, which is a JSON file that tells the reader how to map the concepts from the HDF5 and ELN files in order to populate the template designed to match `NXsimple`. The choices made in the config file define how semantics from the source (data file) and target (NeXus application definition) side are mapped. Essentially, the config file should contain all keys that are present in the NXDL. In our case, the config file looks like this:

```json title="config_file.json"
{
"/ENTRY/title": "@eln",
"/ENTRY/USER[user]": {
"name":"@eln",
"address":@eln:"/ENTRY/USER[user]/address",
"address":"@eln:/ENTRY/USER[user]/address",
},
"/ENTRY/INSTRUMENT[instrument]": {
"@version":"@attrs:metadata/instrument/version",
Expand All @@ -235,9 +249,11 @@ Next up, we can make use of the config file, which is a JSON file that tells the
}
}
```

Note that here we are using `@`-prefixes which are used to fill the template from the different data sources. We dicuss this below in more detail.

We also implement a method for setting the config file in the reader:

```python title="reader.py"
def set_config_file(self, file_path: str) -> Dict[str, Any]:
if self.config_file is not None:
Expand All @@ -247,12 +263,14 @@ def set_config_file(self, file_path: str) -> Dict[str, Any]:
self.config_file = file_path

return {}
```
```

## Filling the template from the read-in data

Finally, after reading in all of the data and metadata as well as designing the config file, we can start filling the template. For this, we must implement functions that are called using the reader's **callbacks**.

We will start with the `@attrs` prefix, associated with the `attrs_callback`. We must implement the `get_attr` method:

```python title="reader.py"
def get_attr(self, key: str, path: str) -> Any:
"""
Expand All @@ -263,13 +281,16 @@ def get_attr(self, key: str, path: str) -> Any:

return self.hdf5_data.get(path)
```

This method (and all similar callbacks methods) have two inputs:

1. **`key`**, which is a key in the config file. Note that here, the generic `"/ENTRY/"` gets replaced by `f"/ENTRY[{entry_name}]/"`, where `entry_name` is the one of the entries of the `self.get_entry_names` method.
2. **`path`**, which is the part of the config value that comes after the `@attrs:` prefix. For example, for the config value `"@attrs:my-metadata"`, the extracted path is `my-metadata`.

For the `get_attr` method, we are making use of the `path`. For example, for the config value `"@attrs:metadata/instrument/version"`, the extracted path is `metadata/instrument/version`, which is also one of the keys of the `self.hdf5_data` dictionary.

For the ELN data, we must implement the `get_eln_data` function that gets called from the `eln_callback` when using the `@eln` prefix:

```python title="reader.py"
def get_eln_data(self, key: str, path: str) -> Any:
"""Returns data from the given eln path."""
Expand All @@ -278,9 +299,11 @@ def get_eln_data(self, key: str, path: str) -> Any:

return self.eln_data.get(key)
```

Here, we are making use of the fact that we have used `CONVERT_DICT` in the `parse_yml` function above. Thus, the keys of the `self.eln_data` dictionary are exactly the same as those in the config file (for example, the config key `"/ENTRY[entry]/USER[user]/address"` also exists in `self.eln_data`). Therefore, we can just get this data using the `key` coming from the config file.

Finally, we also need to address the `@data` prefix, which gets used in the `data_callback` to populate the NXdata group in the template. Note that here we use the same `@data` prefix to fill the `x_values` as well as the `data` (from `y_values`) fields. We achieve this by using the path that follows `@data:` in the config file:

```python title="reader.py"
def get_data(self, key: str, path: str) -> Any:
"""Returns measurement data from the given hdf5 path."""
Expand All @@ -291,6 +314,7 @@ def get_data(self, key: str, path: str) -> Any:
```

## Bringing it all together

Et voilà! That's all we need to read in our data and populate the `NXsimple` template. Our final reader looks like this:

```python title="reader.py"
Expand Down Expand Up @@ -394,6 +418,7 @@ READER = MyDataReader
```

## Using the reader

We can call our reader using the following command

```console
Expand All @@ -402,4 +427,4 @@ user@box:~$ dataconverter mock_data.h5 eln_data.yaml -c config_file --reader myd

The final `output.nxs` file gets automatically validated against `NXsimple`, so we can be sure that it is compliant with that application definition. Here is a look at our final NeXus file:

<img src="media/resulting_file.png" style="width: 50vw; min-width: 330px;" />
<img src="media/resulting_file.png" alt="Resulting File" style="width: 50vw; min-width: 330px;" />
Loading
Loading