Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information #205

Closed
wants to merge 239 commits into from

Conversation

dale-wahl
Copy link
Member

  • Added selenium_scraper as a new Search class to be used in creating new datasources.
  • Created url_scraper datasource which allows a user to scrape a list of urls and up to 5 subpages on the host
  • Modified the column-filter processor to provide detail output showing which matches were found in a given column (this can mimic Tracker Tracker by searching for substrings within HTML and noting which were found/not found in the HTML)
  • updated Search base class to allow after_search_completed method; this was necessary to ensure Selenium webdriver and Chrome browser are properly closed and also works with get_items generators.
  • validate_url helper function added to helpers.py

@stijn-uva I have tested this and not found any issues, but did want you to particularly review the changes to Search.

backend/abstract/selenium_scraper.py Outdated Show resolved Hide resolved
common/lib/helpers.py Show resolved Hide resolved
config.py-example Outdated Show resolved Hide resolved
dale-wahl and others added 28 commits September 19, 2023 13:45
these are more likely due to an issue with the website in question
this is by individual worker class not for all selenium classes... so you can really crank them out if desired
# Conflicts:
#	backend/lib/search.py
#	common/lib/helpers.py
#	common/lib/module_loader.py
#	processors/visualisation/download_videos.py
#	processors/visualisation/image_category_wall.py
#	processors/visualisation/video_hasher.py
#	setup.py
@dale-wahl
Copy link
Member Author

Closing; this will be obsolete with the extensions PR and all updates, datasources, and processors have been moved to a new repository.

I will remove the tracker_tracker and app_stores branches when the above is merged.

@dale-wahl dale-wahl closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants