Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Post Import Service #4370

Open
wants to merge 2 commits into
base: 2.x
Choose a base branch
from
Open

Refactor Post Import Service #4370

wants to merge 2 commits into from

Conversation

kaise-lafrai
Copy link
Contributor

@kaise-lafrai kaise-lafrai commented Dec 23, 2024

QA Steps

Run the following tests.

  • ddev dkan-phpunit --filter PostImportTest
  • ddev dkan-phpunit --filter dictionaryenforcertest

Successful Post Import

  • Set up vanilla site
  • ddev drush en sample_content
  • ddev drush dkan-sample-content:create
  • Lets work with the 'Florida Bike Lanes' dataset otherwise known as node::8
  • Create a data dictionary at/node/add/data?schema=data-dictionary
  • Title OBJECTID
  • Select Add field
  • Name OBJECTID
  • Title OBJECTID
  • Data type Integer
  • Description OBJECTID
  • Grab the Identifier for later and select Save
  • Navigate to DD settings : /admin/dkan/data-dictionary/settings
  • Select Sitewide
  • Add the identifier from previous step into Sitewide Dictionary ID
  • Save configuration
  • Run Cron a few times
  • Confirm DD applied with no issues.

Unsuccessful Post Import (Drop Resource)

  • Navigate here and select Last Update and Save /admin/dkan/datastore
  • Update the florida bike lane node last update date time and save
  • Navigate here /admin/dkan/resources and select Drop the datastore table and save.
  • Navigate to the data dictionary created before
  • Alter the field "Data type" to something like "Date" and save
  • Run cron a few times
  • Navigate to dashboard /admin/dkan/datastore/status
  • Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Store value of waiting(0%) with an error in the Post Import column.
  • Confirm that the dataset was dropped by trying to access via the api and getting status of 400 /api/1/datastore/query/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/0

Data Dictionary Disabled

  • Navigate to data dictionary settings here admin/dkan/data-dictionary/settings and disable dictionary mode
  • Update the florida bike lane node last update date time and save
  • Run cron a few times
  • Navigate to dashboard /admin/dkan/datastore/status
  • Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Post Import status of N/A Data-Dictionary Disabled

Data Dictionary via Distribution Reference

  • Visit admin/dkan/data-dictionary/settings and set the data dictionary mode to "distribution reference."
  • Post the following data dictionary to your DKAN instance:
{
    "title": "Bike lanes data dictionary",
    "data": {
        "fields": [
            {
                "name": "objectid",
                "title": "OBJECTID",
                "type": "integer",
                "description": "Internal feature number."
            },
            {
                "name": "roadway",
                "title": "ROADWAY",
                "type": "string",
                "description": "A unique 8-character identification number assigned to a roadway or section of a roadway either On or Off the State Highway System for which information is maintained in the Department's Roadway Characteristics Inventory (RCI)."
            },
            {
                "name": "road_side",
                "title": "ROAD_SIDE",
                "type": "string",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "enum": ["R", "L", "C"]
                },
                "description": "Side of the road. C = Composite; L = Left side; R = Right side"
            },
            {
                "name": "lncd",
                "title": "LNCD",
                "type": "integer",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "maximum": 5,
                    "minimum": 0
                },
                "description": "Codes 0 = UNDESIGNATED; 1 = DESIGNATED; 2 = BUFFERED; 3 = COLORED; 4 = BOTH 2 AND 3; 5 = SHARROW"
            },
            {
                "name": "descr",
                "title": "DESCR",
                "type": "string",
                "constraints": {
                  "maxLength": 30,
                  "enum": ["UNDESIGNATED", "DESIGNATED"]
                },
                "description": "Designation description."
            },
            {
                "name": "begin_post",
                "title": "BEGIN_POST",
                "type": "number",
                "description": "Denotes the lowest milepoint for the record."
            },
            {
                "name": "end_post",
                "title": "END_POST",
                "type": "number",
                "description": "Denotes the highest milepoint for the record."
            },
            {
                "name": "shape_len",
                "title": "Shape_Leng",
                "type": "number",
                "description": "Length in meters"
            }
        ]
    }
}
  • Post the following dataset to your DKAN instance.
{
    "@type": "dcat:Dataset",
    "accessLevel": "public",
    "contactPoint": {
        "fn": "Jane Doe",
        "hasEmail": "mailto:[email protected]"
    },
    "description": "Data on bike lanes in Florida.",
    "distribution": [
    {
        "@type": "dcat:Distribution",
        "downloadURL": "https://demo.getdkan.org/sites/default/files/distribution/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/Bike_Lane.csv",
        "mediaType": "text\/csv",
        "format": "csv",
        "title": "Florida Bike Lanes",
        "describedBy": "dkan://metastore/schemas/data-dictionary/items/b063524c-cf33-5d06-a161-770fc688d53b",
        "describedByType": "application/vnd.tableschema+json"
    }
    ],
    "identifier": "cedcd327-4e5d-43f9-8eb1-c11850fa7c55",
    "issued": "2016-06-22",
    "license": "http://opendatacommons.org/licenses/by/1.0/",
    "modified": "2016-06-22",
    "publisher": {
        "@type": "org:Organization",
        "name": "State Economic Council"
    },
    "spatial": "Florida",
    "theme": [
        "Transportation",
        "City Planning"
    ],
    "title": "Florida Bike Lanes ",
    "keyword":["bike-lanes", "streets", "infrastructure"]
}
  • Run the import and post_import queues. The dictionary should be applied.

Resource does not have dictionary

QA:

  • Set the DKAN site to use 'Distribution reference' data dictionaries.
  • Import datasets with no associated data dictionaries.
  • Verify that no errors are thrown during the post_import phase, but instead an informative message is displayed.

Helpful script:

ddev dkan-site-install
ddev drush pm-enable sample_content
ddev drush dkan:sample-content:create
ddev drush config:set metastore.settings data_dictionary_mode reference
ddev drush cron
ddev drush cron

You should see messages like this:

 [notice] No data-dictionary found for resource with id "0310f6c2d28bbee5de1211e5b5bb0b1c" and version "1721843317".

You should then also see a message like this in the dataset import status dashboard:

done Resource 0310f6c2d28bbee5de1211e5b5bb0b1c does not have a data dictionary.

@dafeder
Copy link
Member

dafeder commented Dec 24, 2024

Going through the QA, I hit an issue at Unsuccessful Post Import (Drop Resource) > Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Store value of waiting(0%) with an error in the Post Import column.

When I ran cron I got the following error:

$ ddev drush cron
 [notice] Localization of resource 3a187a87dc6cd47c48b6b4c4785224b7: 
 [notice] ImportService for 3a187a87dc6cd47c48b6b4c4785224b7__1735074165 completed.
 [error]  SQLSTATE[22007]: Invalid datetime format: 1292 Incorrect date value: '2049' for column 'objectid' at row 1 
 [error]  Failed to drop datastore for aba79c7a62fec6c2e49bc1cebd0d7233. No datastore storage found for 3a187a87dc6cd47c48b6b4c4785224b7:1735068191. 

On the dashboard, I do see the right error in the postImport column but the store says done:

image

I can confirm the datastore was not dropped and the objectId has reverted back to text (it was int after the previous QA steps):

GET https://kaise.ddev.site/api/1/datastore/query/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/0?results=false&count=false

returns

{
  "schema": {
    "699890c9-480c-5f2e-89c5-06e9723303fb": {
      "fields": {
        "objectid": {
          "type": "text",
          "mysql_type": "text",
          "description": "OBJECTID"
        },
        "roadway": {
          "type": "text",
          "mysql_type": "text",
          "description": "ROADWAY"
        },
        "road_side": {
          "type": "text",
          "mysql_type": "text",
          "description": "ROAD_SIDE"
        },
        "lncd": {
          "type": "text",
          "mysql_type": "text",
          "description": "LNCD"
        },
        "descr": {
          "type": "text",
          "mysql_type": "text",
          "description": "DESCR"
        },
        "begin_post": {
          "type": "text",
          "mysql_type": "text",
          "description": "BEGIN_POST"
        },
        "end_post": {
          "type": "text",
          "mysql_type": "text",
          "description": "END_POST"
        },
        "shape_leng": {
          "type": "text",
          "mysql_type": "text",
          "description": "Shape_Leng"
        }
      }
    }
  },
  "query": {
    "results": false,
    "count": false,
    "resources": [
      {
        "id": "699890c9-480c-5f2e-89c5-06e9723303fb",
        "alias": "t"
      }
    ],
    "limit": 500,
    "offset": 0,
    "schema": true,
    "keys": true,
    "format": "json",
    "rowIds": false,
    "properties": [
      "objectid",
      "roadway",
      "road_side",
      "lncd",
      "descr",
      "begin_post",
      "end_post",
      "shape_leng"
    ]
  }
}

Copy link
Member

@dafeder dafeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! See the question on QA in the PR conversation.

Also, reviewing how the code works there is something off to me in the back and forth between PostImport and PostImportResult. It's just strange to me that PostImport puts this data into an object, which has a storeResult() method, but the storeResult() method just calls another method that's back in PostImport(). It's essentially a circular dependency, though Drupal won't notice because PostImportResult is not a service. I think the correct way to do this would be one of two things:

  1. Make the entire storeResult() method on PostImportResult be essentially self-contained, so it contains the DB query and just needs to be passed a Connection object (ie the database service). Remove PostImport::storeJobStatus(). This would also remove Connection as a dependency for PostImport.
  2. Remove the storeResult() method from PostImportResult, and simply have PostImport::storeJobStatus take a PostImportResult object as its argument, rather than all those individual arguments. This means that rather than rather than calling $result->storeResult(), PostImportResourceProcessor::processItem() would call $this->postImport->storeJobStatus($result).

I'd like to see this changed but if it feels out of scope we can make a new ticket for it, tell me what you think.

* @param mixed $resourceId
* A resource ID.
*/
public function invalidateCacheTags($resourceId): void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a redundant method. It's identical to DatastoreService::invalidateCacheTags, and we are already injecting DatastoreService. Remove this method, remove ReferenceLookup as a dependency, and call the function as $this->datastoreService->invalidateReferencerCacheTags().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(need to make the DatastoreService method public, obviously)

) {
$this->configFactory = $configFactory;
$this->logger = $logger;
$this->resourceMapper = $resourceMapper;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think about how to get the number of dependencies down here. ResourceMapper is also part of DatastoreService. I think it would be safe to just make $datastoreService->resourceMapper public, since there are no public properties or setter methods on resourceMapper. Or we could add a getResourceMapper() method to $datastoreService.

$this->resourceMapper = $resourceMapper;
$this->resourceProcessorCollector = $resourceProcessorCollector;
$this->dataDictionaryDiscovery = $dataDictionaryDiscovery;
$this->referenceLookup = $referenceLookup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commnent below for a suggestion how we can eliminate the referenceLookup dependency.

$timeout = (int) $plugin_definition['cron']['lease_time'];
$alter_table_query_builder->setConnectionTimeout($timeout);
$this->referenceLookup = $referenceLookup;
$this->configFactory = $config_factory;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I don't think this is used for anything?

@kaise-lafrai
Copy link
Contributor Author

kaise-lafrai commented Dec 31, 2024

@dafeder Can you confirm you selected Drop the datastore table if the post import queue reports an error. on the Resource settings page (admin/dkan/resources) before attempting to run a failing post import?

I attempted the QA steps and was able to see "waiting (0%)" under the store column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants