Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce two citation matching related fixes for issues discovered during the last provision round on BETA #1429

Conversation

marekhorst
Copy link
Member

@marekhorst marekhorst commented Sep 6, 2023

Namely:

Usually I will create two separate PRs but those are pretty small fixes so we could cover them with a single PR (will be merged with the master branch with separate commits though).

…emory limits

Introducing new `citationmatchingDirectSparkExecutorOverhead` parameter set to `2048` by default.
…emory limits

Fixing fuzzy citation matching failure due to an executor exceeding memory limits by propagating `sparkExecutorOverhead` to the `citation-matching-input-transformer` spark action. Until now it was propagated to `citation-matching` action only.
@marekhorst marekhorst requested a review from mpol September 6, 2023 15:49
@marekhorst marekhorst self-assigned this Sep 6, 2023
Copy link
Contributor

@mpol mpol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one question.

@@ -276,6 +276,11 @@
<value>${sparkExecutorMemory}</value>
<description>memory for individual executor</description>
</property>
<property>
<name>citationmatchingDirectSparkExecutorOverhead</name>
<value>2048</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this value has to be specified both here and in the config-default.template. And to they have to be always kept in sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in line with other *SparkExecutorOverhead properties defined in workflow.xml file and it does not require keeping in sync values declared in workflow.xml file with the ones from the default-config.xml files.

I think this duplication was mostly kept as safety measure: if the value is not declared in default-config.xml it will be taken from the workflow.xml as a default one. It is quite useful when performing development runs (no default-config.xml involved) but I agree it does not look well having some properties defined in two places therefore I have created a dedicated issue to consider duplicates removal: #1430.

@marekhorst marekhorst merged commit c4ff3fd into master Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants