Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metadata parsing #120

Merged
merged 12 commits into from
Feb 27, 2025
Merged

Fix metadata parsing #120

merged 12 commits into from
Feb 27, 2025

Conversation

thorbjoernl
Copy link
Collaborator

@thorbjoernl thorbjoernl commented Feb 20, 2025

Change Summary

Due to some unfortunate choices of separation characters in file names, it is difficult to properly derive the metadata of a file from the file path alone in some instances. Here I implement an ugly hack to make the parsing work for old templates, while changing the template (and encoding scheme) to prevent this issue in future experiment runs.
The hack relies on some assumptions:

  • network must not contain _ but can contain -
  • region can contain _

This makes some changes which may break stuff, but I think it is necessary to be able to implement proper querying (see #117) which will allow us to remove a lot of the filename parsing in pyaerocom (which leads to breakage when aerovaldb changes storage location).

For future runs, a new template string will be used which consistently uses _ as separation character in filenames. Should _ be included in an argument it will be encoded.

For context, this goes together with the following PRs, but trying to break it into smaller chunks for easier review:

Related issue number

closes #119

Checklist

  • Start with a draft-PR
  • The PR title is a good summary of the changes
  • PR is set to AeroTools and a tentative milestone
  • Documentation reflects the changes where applicable
  • Tests for the changes exist where applicable
  • Tests pass locally
  • Tests pass on CI
  • At least 1 reviewer is selected
  • Make PR ready to review

@thorbjoernl thorbjoernl added this to the m2025-03 milestone Feb 21, 2025
@thorbjoernl thorbjoernl marked this pull request as ready for review February 26, 2025 09:06
Copy link
Member

@heikoklein heikoklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Will you add the new character-constraints to pyaerocom, too, to get a proper feedback if wrong project/experiment name is chosen?

@thorbjoernl
Copy link
Collaborator Author

Will you add the new character-constraints to pyaerocom, too, to get a proper feedback if wrong project/experiment name is chosen?

I don't think I am introducing any new character constraints, but rather following the unwritten constraints as they currently exist. The pyaerocom splitting of filenames would already break if these constraints aren't followed.

That being said, as part of #121 I support transparently encoding characters that aerovaldb requires for disambiguating metadata in file names which removes these character constraints moving forward. Maybe we can revisit the discussion on that PR?

@heikoklein heikoklein merged commit 50ea3c3 into main Feb 27, 2025
6 checks passed
@heikoklein heikoklein deleted the fix-metadata-parsing branch February 27, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aerovaldb is unable to reliably derive metadata from file path
2 participants