Skip to content

Crawl web to produce singer.io tap compatible output

License

Notifications You must be signed in to change notification settings

anelendata/tap-webcrawl

Repository files navigation

tap-webcrawl

Crawl web with Selenium to download a data file containing tables then output singer.io tap compatible format.

This is a Singer tap that produces JSON-formatted data following the Singer spec.

This tap:

  • Crawls the web to download a data file
  • Outputs the schema for each resource
  • Currently only supports Excel (.xls) file to singer tap conversion

Usage:

It's rather complicated right now. Here is an example project that you can learn from.

  1. Create the spec for config file:

Example:

{
    "application": "Crawl some ancient web server to get the data files",
    "args": {
        "id":
        {
            "type": "string",
            "default": "01",
            "help": "This will fill the URL as https://www.example.com/some_resource/{id}"
        },
        "format":
        {
            "type": "string",
            "default": "xls",
            "help": "This will fill the URL as https://www.example.com/some_resource/01?format={format}"
        }
    }
}

Note: Currently, you need to create this file even if you don't want to modify the default config specs. In such cases, please provide an empty object:

{}

The args that are reserved default can be found default_spec.json

  1. Generate Selenium IDE Python script

  2. Install Selenium IDE as a browswer plugin.

  3. Browse web manually to complete one task

  4. Export Python test script and save somewhere (keep things untitled so the class name is the default "TestDefaultSuit" and the test function is "test_untitled")

  5. Edit the script to parameterize so tap can replay with different params

Hint: Refer to the example Python. This example passes URL parameter via kwargs. It also inserts with_retry() method to patiently wait for the slow web server.

  1. Create Config file based on the spec:

Example:

{
  "datetime_key": "last_modified_at",
  "schema_dir": <path_to_schema_dir>
  "selenium_ide_script": "./selenium_ide_export/default.py"  // This Python script is from the previous step
  "id": "01" // a user defined config value to be used as a parameter for the Selenium script
}
  1. Create schema and catalog files
$ tap_webcrawl spec.json --infer_schema --config config.json --schema_dir ./schema

4.Run the tap

$ tap_webcrawl spec.json --config config.json --catalog catalog.json

Copyright © 2019~ Anelen Co., LLC

About

Crawl web to produce singer.io tap compatible output

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages