Crawl web with Selenium to download a data file containing tables then output tap compatible format.
This is a Singer tap that produces JSON-formatted data following the Singer spec.
This tap:
- Crawls the web to download a data file
- Outputs the schema for each resource
- Currently only supports Excel (.xls) file to singer tap conversion
It's rather complicated right now. Here is an example project that you can learn from.
- Create the spec for config file:
"application": "Crawl some ancient web server to get the data files",
"args": {
"type": "string",
"default": "01",
"help": "This will fill the URL as{id}"
"type": "string",
"default": "xls",
"help": "This will fill the URL as{format}"
Note: Currently, you need to create this file even if you don't want to modify the default config specs. In such cases, please provide an empty object:
The args that are reserved default can be found default_spec.json
Generate Selenium IDE Python script
Install Selenium IDE as a browswer plugin.
Browse web manually to complete one task
Export Python test script and save somewhere (keep things untitled so the class name is the default "TestDefaultSuit" and the test function is "test_untitled")
Edit the script to parameterize so tap can replay with different params
Hint: Refer to the example Python. This example passes URL parameter via kwargs. It also inserts with_retry() method to patiently wait for the slow web server.
- Create Config file based on the spec:
"datetime_key": "last_modified_at",
"schema_dir": <path_to_schema_dir>
"selenium_ide_script": "./selenium_ide_export/" // This Python script is from the previous step
"id": "01" // a user defined config value to be used as a parameter for the Selenium script
- Create schema and catalog files
$ tap_webcrawl spec.json --infer_schema --config config.json --schema_dir ./schema
4.Run the tap
$ tap_webcrawl spec.json --config config.json --catalog catalog.json
Copyright © 2019~ Anelen Co., LLC