Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting runs from old harvest_ID_runs to harvest_runs fails if duplicate time stamps #4287

Open
swirtSJW opened this issue Sep 16, 2024 · 4 comments · May be fixed by #4346
Open

Converting runs from old harvest_ID_runs to harvest_runs fails if duplicate time stamps #4287

swirtSJW opened this issue Sep 16, 2024 · 4 comments · May be fixed by #4346
Assignees
Labels

Comments

@swirtSJW
Copy link
Contributor

swirtSJW commented Sep 16, 2024

Current Behavior

When updating to DKAN 2.19 from a lower version, if two different harvest_ID_runs tables happen to have the same timestamp a sql error is thrown because the timestamp is treated as the unique identifier. This is unlikely since it is only a span of 1 second, but it is possible to encounter in the wild.

> >  [notice] Converting runs for home_health__data
> >  [error]  Drupal\Core\Database\IntegrityConstraintViolationException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  in Drupal\mysql\Driver\Database\mysql\ExceptionHandler->handleExecutionException() (line 45 of /var/www/html/docroot/core/modules/mysql/src/Driver/Database/mysql/ExceptionHandler.php). 
> >  [error]  SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  
> >  [error]  Update failed: harvest_update_8008 
> >  [notice] Update started: metastore_update_8009
> >  [notice] Updated 0 dictionaries. If you have overridden DKAN's core schemas,
> >     you must update your site's data dictionary schema after this update. Copy
> >     modules/contrib/dkan/schema/collections/data-dictionary.json over you local
> >     site version before attempting to read or write any data dictionaries.
> >  [notice] Update completed: metastore_update_8009
> >  [notice] Update started: metastore_admin_update_8012
> >  [notice] Update completed: metastore_admin_update_8012
>  [error]  Update aborted by: harvest_update_8008 
>  [error]  Finished performing updates. 

Expected Behavior

The migration of data from one table to another should happen without error.

Steps To Reproduce

  1. Have a data setup where you have harvest_ID_runs that contain the same timestamps.
  2. run update.php, or drush upddb or drush dkan:harvest:update
  3. See errors

Relevant log output (optional)

No response

Anything else?

This may be too unlikely a scenerio to add a try catch block HarvestUtility::convertRunTable() but I will at least provide a drush sqlc command to undo any duplicated IDs.

Discussion in CA Slack

@swirtSJW swirtSJW added the bug label Sep 16, 2024
@swirtSJW
Copy link
Contributor Author

swirtSJW commented Sep 16, 2024

Is there any issue with just incrementing the timestamp by 1 second (hoping it would make it then be unique)? Or is that timestamp used as a unique ID other places in the system?

@swirtSJW
Copy link
Contributor Author

swirtSJW commented Sep 16, 2024

It turns out that bumping the timestamp will likely disconnect the harvest run from all other things. This also means it could not be addressed with a try catch.

The new hope is that some variation of this might work:

  • Change the harvest_runs schema to NOT have the id be a key.
    image

The HarvestRunRepository()::loadEntity() function treats the id AND the harvest_plan_id as the combined key to look up the harvest run entity.

So if the id were not the key for the table, then there would be no issue with the id needing to be unique. The only risk would be if two harvest runs from the same plan took place at the same time.

@paul-m
Copy link
Contributor

paul-m commented Nov 1, 2024

I think the easiest solution to implement would be to add another column which is the actual ID (maybe a uuid), and use that as the unique key. Leave everything else in place, and provide an update path to the new entity schema for both old style harvest_id_run tables and the newer entity tables.

@swirtSJW swirtSJW linked a pull request Nov 20, 2024 that will close this issue
3 tasks
@swirtSJW
Copy link
Contributor Author

swirtSJW commented Dec 6, 2024

Picking this up after a failed attempt to just make id not a key. An entity must have a key.
slack thread

The new direction is to overhaul the whole thing by to make is more standard. Here is the change list:

  • add a 'timestamp' basefield
  • move 'id' content to 'timestamp'
  • replace 'id' content with 1, 2, 3 ...
  • add 'uuid' basefield (not needed for this, but brining into alignment with other dkan entities)
  • convert 'id' basefield from string to auto-increment int
  • alter the harvest_run->load function to look for 'timestamp' instead of 'id' ,
  • Alter function that converts the harvest run from old pre-2.19 (lookup the name of funciton)
  • Make tests pass.

@swirtSJW swirtSJW self-assigned this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants