Make catalog harvesting optional #69

Markus92 · 2024-08-05T14:03:34Z

🚀 Pull Request Checklist

Title:
- [X] A brief, descriptive title for the changes.

Make catalog harvesting optional

Description:
- [X] Provide a clear and concise description of your pull request, including the purpose of the changes and the approach you've taken.

This PR adds a setting in the harvester, to make it optional to harvest catalogs as datasets in the back-end. Currently, they show up the exact same in the front-end and this could be undesirable.

The behavior is fully configurable, both on global as well as per-harvester level.

Context:
- [X] Why are these changes necessary? What problem do they solve? Link any related issues.

Closes GenomicDataInfrastructure/gdi-userportal-ckanext-gdi-userportal#49

Changes:
- [ ] List the major changes you've made, ideally organized by commit or feature.

Small change only.

Testing:
- [ ] Describe how the changes have been tested. Include any relevant details about the testing environment and the test cases.

The test cases are extremely hard to set up and I have not managed to replicate the full unit testing environment. I tested the harvesters in a development environment with all combinations of settings (global true/false/undefined and in the json true/false/undefined).

Screenshots (if applicable):
N/A
Additional Information:
N/A
Checklist:
- [X] I have checked that my code adheres to the project's style guidelines and that my code is well-commented.
- [X] I have performed self-review of my own code and corrected any misspellings.
- [X] I have made corresponding changes to the documentation (if applicable).
- [X] My changes generate no new warnings or errors.
- [?] I have added tests that prove my fix is effective or that my feature works. (N/A: please help)
- [?] New and existing unit tests pass locally with my changes. (N/A: please help)

Summary by Sourcery

Add a configurable setting to make catalog harvesting optional, with the ability to configure it globally or per-harvester. Update documentation to reflect the new setting.

New Features:

Introduce a configurable setting to make catalog harvesting optional in the harvester, applicable both globally and per-harvester.

Enhancements:

Add logging to indicate the source of the harvest_catalogs setting and its value.

Documentation:

Update README.md to document the new ckanext.fairdatapoint.harvest_catalogs setting and its usage.

sourcery-ai · 2024-08-05T14:03:41Z

Reviewer's Guide by Sourcery

This pull request introduces a new configurable setting to make catalog harvesting optional in the harvester. The changes include updates to the FairDataPointRecordProvider and FairDataPointCivityHarvester classes to support this setting, as well as documentation updates in the README.md file. The 'harvest_catalogs' setting can be configured globally or overridden per harvester.

File-Level Changes

Files	Changes
`ckanext/fairdatapoint/harvesters/domain/fair_data_point_record_provider.py` `ckanext/fairdatapoint/harvesters/fair_data_point_civity_harvester.py`	Introduced and integrated a new 'harvest_catalogs' setting to make catalog harvesting optional, with support for both global and per-harvester configurations.

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @Markus92 - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider improving the testability of the code or providing more detailed documentation on the manual testing process. The lack of unit tests for these changes is a concern.
Ensure that the code style changes (e.g., switching from single to double quotes) align with the project's style guide, if one exists. Consistency in style is important for long-term maintainability.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

README.md

sourcery-ai · 2024-08-05T14:05:00Z

ckanext/fairdatapoint/harvesters/fair_data_point_civity_harvester.py

@@ -2,30 +2,54 @@
 # SPDX-FileContributor: 2024 Stichting Health-RI
 #
 # SPDX-License-Identifier: AGPL-3.0-only
-
+import logging


issue (complexity): Consider simplifying the code while maintaining the new functionality.

The new code introduces useful functionality but also adds complexity. Here are some points to consider:

Increased Complexity: The new code has more lines and nested conditions, making it harder to read and understand at a glance. The original code was more straightforward.

Logging and Configuration Handling: While logging is useful, it adds complexity by introducing more logic and potential points of failure. The original code did not have this additional layer of configuration handling.

Use of toolkit: The new code uses toolkit to fetch configuration values, adding another dependency and layer of abstraction. This makes the code harder to maintain.

Conditional Logic: The new code has more conditional logic to handle different sources of configuration (global CKAN level vs harvest config). This increases the cognitive load required to understand the flow of the program.

Consider simplifying the code while maintaining the new functionality. Here is a suggestion:

import logging from ckanext.fairdatapoint.harvesters.civity_harvester import CivityHarvester from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_provider import ( FairDataPointRecordProvider, ) from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_to_package_converter import ( FairDataPointRecordToPackageConverter, ) from ckan.plugins import toolkit PROFILE = "profile" HARVEST_CATALOG = "harvest_catalogs" HARVEST_CATALOG_CONFIG = "ckanext.fairdatapoint.harvest_catalogs" log = logging.getLogger(__name__) class FairDataPointCivityHarvester(CivityHarvester): def setup_record_provider(self, harvest_url, harvest_config_dict): # Determine harvest_catalogs from config or default to False harvest_catalogs = toolkit.asbool( harvest_config_dict.get(HARVEST_CATALOG, toolkit.config.get(HARVEST_CATALOG_CONFIG, False)) ) log.debug("harvest_catalogs: %s", harvest_catalogs) self.record_provider = FairDataPointRecordProvider(harvest_url, harvest_catalogs) def setup_record_to_package_converter(self, harvest_url, harvest_config_dict): if PROFILE in harvest_config_dict: self.record_to_package_converter = FairDataPointRecordToPackageConverter( harvest_config_dict.get(PROFILE) ) else: raise Exception("[{0}] not found in harvester config JSON".format(PROFILE)) def info(self): return { "name": "fair_data_point_harvester", "title": "FAIR data point harvester", "description": "Harvester for end points implementing the FAIR data point protocol", }

This approach reduces complexity while maintaining the new functionality.

Hi bot, I wouldn't exactly call nested get statements more readable. The entire plugin is a mess when it comes to the style guide, but to keep the diffs small I'm not running a full linter over it yet.

Hi Markus92,

I understand your concern about nested get statements potentially reducing readability. The goal of my suggestion was to streamline the logic and reduce the cognitive load by consolidating the configuration fetching into a single line. However, I see your point about the overall style and the importance of keeping the diffs small.

If running a full linter over the entire plugin isn't feasible right now, perhaps we can find a middle ground. We could refactor the configuration fetching into a separate method to improve readability without introducing too many changes at once. Here's an example:

import logging from ckanext.fairdatapoint.harvesters.civity_harvester import CivityHarvester from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_provider import ( FairDataPointRecordProvider, ) from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_to_package_converter import ( FairDataPointRecordToPackageConverter, ) from ckan.plugins import toolkit PROFILE = "profile" HARVEST_CATALOG = "harvest_catalogs" HARVEST_CATALOG_CONFIG = "ckanext.fairdatapoint.harvest_catalogs" log = logging.getLogger(__name__) class FairDataPointCivityHarvester(CivityHarvester): def setup_record_provider(self, harvest_url, harvest_config_dict): harvest_catalogs = self._get_harvest_catalogs(harvest_config_dict) log.debug("harvest_catalogs: %s", harvest_catalogs) self.record_provider = FairDataPointRecordProvider(harvest_url, harvest_catalogs) def setup_record_to_package_converter(self, harvest_url, harvest_config_dict): if PROFILE in harvest_config_dict: self.record_to_package_converter = FairDataPointRecordToPackageConverter( harvest_config_dict.get(PROFILE) ) else: raise Exception("[{0}] not found in harvester config JSON".format(PROFILE)) def info(self): return { "name": "fair_data_point_harvester", "title": "FAIR data point harvester", "description": "Harvester for end points implementing the FAIR data point protocol", } def _get_harvest_catalogs(self, harvest_config_dict): return toolkit.asbool( harvest_config_dict.get(HARVEST_CATALOG, toolkit.config.get(HARVEST_CATALOG_CONFIG, False)) )

This way, the main methods remain clean, and the configuration logic is encapsulated in a separate method. Let me know what you think!

But now you removed the logging. How can I see where the configuration came from? That would be extremely useful for debugging issues when the wrong behavior is shown.

Hi Markus92,

I understand the importance of logging for debugging purposes. Let's incorporate the logging back into the refactored method to ensure we can trace where the configuration values are coming from. Here's an updated version that includes detailed logging:

import logging from ckanext.fairdatapoint.harvesters.civity_harvester import CivityHarvester from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_provider import ( FairDataPointRecordProvider, ) from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_to_package_converter import ( FairDataPointRecordToPackageConverter, ) from ckan.plugins import toolkit PROFILE = "profile" HARVEST_CATALOG = "harvest_catalogs" HARVEST_CATALOG_CONFIG = "ckanext.fairdatapoint.harvest_catalogs" log = logging.getLogger(__name__) class FairDataPointCivityHarvester(CivityHarvester): def setup_record_provider(self, harvest_url, harvest_config_dict): harvest_catalogs = self._get_harvest_catalogs(harvest_config_dict) log.debug("harvest_catalogs: %s", harvest_catalogs) self.record_provider = FairDataPointRecordProvider(harvest_url, harvest_catalogs) def setup_record_to_package_converter(self, harvest_url, harvest_config_dict): if PROFILE in harvest_config_dict: self.record_to_package_converter = FairDataPointRecordToPackageConverter( harvest_config_dict.get(PROFILE) ) else: raise Exception("[{0}] not found in harvester config JSON".format(PROFILE)) def info(self): return { "name": "fair_data_point_harvester", "title": "FAIR data point harvester", "description": "Harvester for end points implementing the FAIR data point protocol", } def _get_harvest_catalogs(self, harvest_config_dict): if HARVEST_CATALOG in harvest_config_dict: log.debug("Using harvest_catalogs from harvest_config_dict") return toolkit.asbool(harvest_config_dict[HARVEST_CATALOG]) else: log.debug("Using harvest_catalogs from global CKAN config") return toolkit.asbool(toolkit.config.get(HARVEST_CATALOG_CONFIG, False))

This way, we maintain the readability improvements while ensuring that the logging provides the necessary information for debugging. Let me know if this approach works for you!

Markus92 · 2024-08-06T13:39:05Z

@sourcery-ai review

sourcery-ai

Hey @Markus92 - I've reviewed your changes - here's some feedback:

Overall Comments:

The changes look good overall, but we're concerned about the lack of comprehensive testing. Could you provide more details on the testing you've done in the development environment? Additionally, it would be beneficial to add some unit tests for this new feature if possible.
Consider adding more detailed logging around the harvest_catalogs setting, especially when it's being overridden by the harvester-specific config. This will make it easier to debug configuration issues in production.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-08-06T13:40:15Z

ckanext/fairdatapoint/harvesters/fair_data_point_civity_harvester.py



 class FairDataPointCivityHarvester(CivityHarvester):

+    def _get_harvest_catalog_setting(self, harvest_config_dict):


issue (complexity): Consider simplifying the configuration handling and removing unnecessary logging.

The new code introduces additional complexity due to extra logging, a new method for configuration handling, and more verbose import statements. While these changes add functionality, they also make the code harder to read and maintain. Consider simplifying the configuration handling by doing it inline within the setup_record_provider method and removing unnecessary logging. Here's a simplified version that maintains the new functionality:

from ckanext.fairdatapoint.harvesters.civity_harvester import CivityHarvester from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_provider import FairDataPointRecordProvider from ckanext.fairdatapoint.harvesters.domain.fair_data_point_record_to_package_converter import FairDataPointRecordToPackageConverter from ckan.plugins import toolkit PROFILE = "profile" HARVEST_CATALOG = "harvest_catalogs" HARVEST_CATALOG_CONFIG = "ckanext.fairdatapoint.harvest_catalogs" class FairDataPointCivityHarvester(CivityHarvester): def setup_record_provider(self, harvest_url, harvest_config_dict): harvest_catalogs = toolkit.asbool( harvest_config_dict.get(HARVEST_CATALOG, toolkit.config.get(HARVEST_CATALOG_CONFIG, False)) ) self.record_provider = FairDataPointRecordProvider(harvest_url, harvest_catalogs) def setup_record_to_package_converter(self, harvest_url, harvest_config_dict): if PROFILE in harvest_config_dict: self.record_to_package_converter = FairDataPointRecordToPackageConverter( harvest_config_dict.get(PROFILE) ) else: raise Exception(f"[{PROFILE}] not found in harvester config JSON") def info(self): return { "name": "fair_data_point_harvester", "title": "FAIR data point harvester", "description": "Harvester for end points implementing the FAIR data point protocol", }

This version reduces complexity while keeping the new features intact.

Markus92 added 3 commits August 5, 2024 14:30

feat: add option for harvesting catalogs

d694918

Added logging option

7972a13

Documentation update for harvest_catalogs option

c9b6625

sourcery-ai bot reviewed Aug 5, 2024

View reviewed changes

Markus92 added 2 commits August 5, 2024 16:08

Fix capitalization to make bot happy

367f562

Split setting to different function

8e68b25

sourcery-ai bot reviewed Aug 6, 2024

View reviewed changes

hcvdwerf approved these changes Aug 7, 2024

View reviewed changes

Merge branch 'main' into WP4-132_no_catalog_harvest

20f829b

hcvdwerf changed the base branch from main to 72-user-story-as-user-i-want-to-configure-per-harvest-source-i-have-want-to-harvest-the-catalog-as-well August 7, 2024 09:42

hcvdwerf merged commit 0f1fc2c into GenomicDataInfrastructure:72-user-story-as-user-i-want-to-configure-per-harvest-source-i-have-want-to-harvest-the-catalog-as-well Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make catalog harvesting optional #69

Make catalog harvesting optional #69

Markus92 commented Aug 5, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 5, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Aug 5, 2024

Markus92 Aug 5, 2024

sourcery-ai bot Aug 5, 2024

Markus92 Aug 5, 2024

sourcery-ai bot Aug 5, 2024

Markus92 commented Aug 6, 2024

sourcery-ai bot left a comment

sourcery-ai bot Aug 6, 2024



		class FairDataPointCivityHarvester(CivityHarvester):

		def _get_harvest_catalog_setting(self, harvest_config_dict):

Make catalog harvesting optional #69

Make catalog harvesting optional #69

Conversation

Markus92 commented Aug 5, 2024 • edited by sourcery-ai bot Loading

🚀 Pull Request Checklist

Summary by Sourcery

sourcery-ai bot commented Aug 5, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 5, 2024

Choose a reason for hiding this comment

Markus92 Aug 5, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 5, 2024

Choose a reason for hiding this comment

Markus92 Aug 5, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 5, 2024

Choose a reason for hiding this comment

Markus92 commented Aug 6, 2024

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 6, 2024

Choose a reason for hiding this comment

Markus92 commented Aug 5, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 5, 2024 •

edited

Loading