Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Extension Support #200

Closed

Conversation

JVickery-TBS
Copy link
Contributor

Adds support for the ckanext-validation plugin. Behind the new config option ckanext.xloader.requires_validation, Resources will be required to be validated before being xloadered.

- Add config option for resources to require validation.
- Only submit to xloader if validation is successful.
- Added debug logging.
- Do not use 2.10+ helper.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Check toolkit action instead of plugin name.
- Added method docs to explain new method.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
# Conflicts:
#	ckanext/xloader/plugin.py
### RESOLVED.
- Fixed typo.
- Fixed try/catch.
@JVickery-TBS JVickery-TBS requested a review from ThrawnCA January 29, 2024 20:11
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Prevent POST to upload to DS if validation is required.
- flake8 if wrap indents.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA I added the flash message to the resource_data POST request here too now. So will prevent users from using the "Upload to DataStore" button if the resource has not passed validation yet.

@JVickery-TBS JVickery-TBS requested a review from ThrawnCA January 29, 2024 21:09
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Syntax fixes from flake8.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA fixed flake8 check here now as well.

- Moved logic into util method.
- Renamed config options.
- Added new config option to enforce validation schema existance.
# Conflicts:
#	ckanext/xloader/plugin.py
#	ckanext/xloader/utils.py
### RESOLVED.
- Added automated tests.
- Changed the logic to be better and more clear.
- Fixed a logic case.
@JVickery-TBS JVickery-TBS requested a review from ThrawnCA February 2, 2024 18:22
@duttonw
Copy link
Collaborator

duttonw commented Feb 3, 2024

Love the work you have done on this @JVickery-TBS . One thing that I think would make this perfect is to see if xloader can hook into validation to run the xloader load in sync mode right after we have a success, or maybe another priority queue so its not delayed again. Since validation success ensure you don't have blank lines or duplicate headers, the fast load should always run and not defer back to messytables slow python orm style commits.

Just a bit of background:
Reason being is that in data.qld.gov.au we defer validation to the batch queue so that the user does not have a stalled process on large datasets or slow url collections. We sometimes get alot of 1million row updated csv's loaded and if they fail the fast load take up to 1hour to be ingested. If we had a validation job occur and then had the update datastore put behind such a resource, then the author would get impatient and lodge a support ticket on why their resource is not updated.

Something to think about with this integration:
Another problem I forsee, mostly in the realm of the validation extension, is that when you save the dataset or re order said dataset, it may then re-validate ALL resources. It would be awesome if we could utilise part of xloader skip checks if the filehash is the same to save on duplicate runs.

Possible other extension enhancement for performance:
Unsure if your system uses the archiver extension. data.qld.gov.au does. This means we have downloaded from url or off disk/blob store the file 3 times. or more due to validation loops. Thankfully the archiver is quite fast while the validation and xloader is quite slow.

# Conflicts:
#	ckanext/xloader/config_declaration.yaml
#	ckanext/xloader/plugin.py
#	ckanext/xloader/tests/test_plugin.py
### RESOLVED.
- Better conditional syntax.
@JVickery-TBS
Copy link
Contributor Author

@duttonw Yeah we do the same thing with our setup as well in which we have to do async mode for validation and xloader as most of the CSVs that get upload are gigantic.

As for the UX for users waiting for these things, I am currently working on some UI things (open-data/ckanext-canada#1430) to show the status of Xloader for a Resource, e.g. "Pending" or "Running" etc. Still kind of a work in progress, but let me know if this would be a good thing to put back into the Xloader repo.

Regarding the archiver, we actually use the cloud storage plugin, so things have to be gathered from there every time. But with the loop that happens, we kind of actually found out that this was an issue with the Core CKAN code and how the Resource plugin hooks work. Things tend to always get run multiple times because the Resource actions always call the Package actions. Still discussions to go on how to fix that.

open-data/ckanext-validation#9 <- I put in a temporary fix on our Validation fork to prevent the looping. Along with a temporary fix in our Xloader one (open-data@df5a4c0). These two things prevent any job looping. This was for sure a major issue as the resource_patch from Xloader would cause another Validation run, which would then cause another Xloader run and stop because file hash did not change, moreover, it would do that to ALL the resources in a dataset. So yeah, updating a single Resource should have enqueued 1 or 2 jobs, but could end up enqueuing like 70.

YES the validation and xloader were slow, so because of this loop, some datasets with 200+ resources in them backed up the job queue to a couple thousand. It was terrible and that is when I looked to fix this job looping thing. I will try and get my job looping prevention into the ckanext-validation repo this week. Even if this gets fixed sometime in upstream, should not really have a major impact with the work-around in Validation.

But as for your request to have validation do resource_patch and have Xloader ALWAYS do sync instead of async in that specific case, I shall look into that this week as well. Because, yeah, I feel like it could be annoying for a user to have to wait for Validation, and then after that passes, wait again for their Xloader job (that is now at the end of the queue) to run. Especially on busier sites or at busier times when the queue might be larger. (Especially with the job loop stuff happening, this is basically always hahaha)

- Fixed inline comments to make more sense.
@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 5, 2024

Another issue with asynchronous processing is that the logic in plugin.py to check whether XLoader can proceed will always see the previous validation_status.

- Continued sync mode chaining from validation.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA @duttonw thanks for sending over that commit, I will check it out to see if I can use some of the code.

The issue that I am seeing here is now how RQ runs the jobs in workers. Firstly, you would have to have a worker running for that queue, and have it in the top priority. I would like to not have users need to run a separate queue. Have not read Thrawn's commit yet, but maybe it just executes the burst of that custom queue right away?

I would also like to support other devs in which they call the submit to xloader action, it should probably always enqueue by default.

I am currently having difficulties preventing loops when running the xloader directly from the validation job. So just trying to figure that one out.

- Continued sync mode chaining from validation.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA @duttonw its not exactly where I want it, but am at a bit of a block right now.

I got it to a point in which if it is chaining from the Validation job, then it will enqueue a job and push it to the start of the queue.

Bad things with this are:

  • the condition to check if we are inside of a validation job kinda sucks;
  • validation can do sync mode itself, so does not necessarily even have a job;
  • if a user has another queue other than default with a higher priority, then the push to top of default queue might not actually work great.

So I am not really sure at this point how to move forward, I feel like the best/easiest thing would be the add some feature to ckanext-validation. The extension DOES have a context variable called _performed_validation. This WOULD solve all the issues with the bad job check, and the sync validation style. But as it is inside of the context variable, the notify method from the IDomainObjectModification implementation does not have the context passed to it. So I am also trying to figure out a way to get that variable

ckanext/xloader/action.py Outdated Show resolved Hide resolved
@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 7, 2024

@JVickery-TBS Have you looked at the interaction between the archiver and QA plugins? Archiver defines an IPipe interface to receive callbacks, QA implements that interface, so when Archiver has finished uploading a resource, QA gets nudged to go and examine it.

A similar pattern could be used here, but would require changes to the Validation plugin to add callback support.

# Conflicts:
#	ckanext/xloader/action.py
### RESOLVED.
- Implement experimental `IPipeValidation` implement.
@JVickery-TBS
Copy link
Contributor Author

requires: ckan/ckanext-validation#97

- Cannot do tests without `IPipeValidation`.
ckanext/xloader/plugin.py Outdated Show resolved Hide resolved
- Clearer comments.
- Clearer log messages.
- Added `ignore_not_sysadmin` validator to the sync key.
ckanext/xloader/action.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@duttonw duttonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than @ThrawnCA debug wording tweak. It looks good.

Is these commits used anywhere else (i.e. is it a candidate for squashing for clean history).

Can you also update the README.md and maybe the Change log of what you added.

Since I've recently become a mantainer of the pypi module for this so I'd like to push this out to that soon after. (Doing it correctly with a github action pipeline)

- Added to changelog.
- Added to readme.
- Fixed typo.
# Conflicts:
#	CHANGELOG
#	README.rst
#	ckanext/xloader/config_declaration.yaml
#	ckanext/xloader/plugin.py
#	ckanext/xloader/utils.py
### RESOLVED.
JVickery-TBS added a commit to JVickery-TBS/ckanext-xloader that referenced this pull request Dec 9, 2024
- Squashed version of ckan#200.
@JVickery-TBS
Copy link
Contributor Author

@duttonw @ThrawnCA squashed version: #237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants