Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest hepdata #594

Closed
michamos opened this issue Oct 11, 2024 · 4 comments · Fixed by inspirehep/inspirehep#3322
Closed

harvest hepdata #594

michamos opened this issue Oct 11, 2024 · 4 comments · Fixed by inspirehep/inspirehep#3322
Assignees
Milestone

Comments

@michamos
Copy link
Collaborator

michamos commented Oct 11, 2024

We need to harvest hepdata daily to get new and updated records, convert them to our metadata schema and update our records. In the current infrastructure, this encompasses both actual harvesting (as in hepcrawl) and holdingpen logic (as in inspire-next), but how and whether to split those in this case is a technical decision still to be determined.

The logic is as follows:

  1. Daily, get modified recids since the last successful crawl date from https://www.hepdata.net/search/ids as we're doing in https://github.com/inspirehep/inspirehep/blob/master/backend/inspirehep/hepdata/cli.py.
  2. For each of those record IDs, fetch https://www.hepdata.net/record/{recid}?format=json, retrieve the metadata (not attached documents) and convert it to our metadata schema. In particular, this requires deriving the main unversioned DOI by removing the .vN suffix from record.hepdata_doi (not present explicitly in the metadata).
  3. If the version is not equal to 1, we need to retrieve all DOIs associated with previous versions by fetching https://www.hepdata.net/record/{recid}?format=json&version={previous_version} for all 1 <= previous_version < version and add them to the record.
  4. Match the main DOI (main unversioned DOI is sufficient here) against existing Data records (or look in the pidstore). 5. If there's a match, it means it's an update and we need to replace the existing record with the new version, otherwise we create a new one, containing the metadata determined in steps 3. and 4.
@michamos michamos added this to the data MVP milestone Oct 11, 2024
@drjova
Copy link
Contributor

drjova commented Nov 27, 2024

blocked by #621

@drjova
Copy link
Contributor

drjova commented Nov 27, 2024

Create a DAG to harvest Hepdata in airflow

@DonHaul DonHaul self-assigned this Nov 29, 2024
@DonHaul DonHaul linked a pull request Nov 29, 2024 that will close this issue
@michamos
Copy link
Collaborator Author

michamos commented Nov 29, 2024

Metadata mapping for steps 2. and 3.

HEPData field INSPIRE field details from which version?
data_tables.doi dois.value with corresponding material set to part all
record.collaborations collaborations.value in the future, will need to be normalized like for literature (not for MVP) latest
/ accelerator_experiments derived from collaborations like for literature (can be postponed until collaboration normalization is implemented)
record.creation_date creation_date latest
record.data_abstract abstracts.value latest
record.data_keywords keywords.value example values: cmenergies: 13000-13000, observables: m_MMC, other values are added directly as keywords latest
record.doi literature.doi single entry with both doi and record latest
record.hepdata_doi dois.value with corresponding material set to version or data for versioned/unversioned DOI respectively all
record.inspire_id literature.record.$ref with correct URL set for a literature record latest
record.resources urls url -> value, description -> description, ignore any urls starting with https://www.hepdata.net/record/resource/ latest
record.title titles.title latest
resources_with_doi.doi dois.value with material set to part all
/ acquisition_source populate as usual
/ authors taken from first INSPIRE record linked under literature (at indexing time)

@GraemeWatt could you please check if this make sense from your side?

@DonHaul
Copy link
Collaborator

DonHaul commented Dec 2, 2024

note: its not possible to import inspire-schemas into airflow as it uses an imcompatible old version of jsonschema that conflicr with the version required by airflow. which only supports draft4 or lower. newer versions on jsonschema are no longer supported for python2.7 so also no go.

possible solutions,

  • spin up a machine to do this specific task
  • implement this task in a service where there are no conflict such as backoffice. this was the option chosen.

inspire-schemas is required to do the harvesting, but it is conflicting with opensearch-py as the former uses inspire-utils which in turn uses a quite old version of urllib3 - 1.26.12. investigate the issue and bump urllib3 on inspire-utils.
blocked by #625

DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 4, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 10, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 10, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 10, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 11, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 12, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 12, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
DonHaul added a commit to DonHaul/inspirehep that referenced this issue Dec 12, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
drjova pushed a commit to inspirehep/inspirehep that referenced this issue Dec 12, 2024
reworked httphooks to make more easy to use

* ref: cern-sis/issues-inspire/issues/594
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants