Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information #205

Closed
wants to merge 239 commits into from
Closed
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
1459ea3
Selenium base
dale-wahl Nov 12, 2021
722fd0f
Added a nice little simple search webpages datasource
dale-wahl Nov 12, 2021
120f2e4
Minor edits to ensure 4cat imports url_scraper
dale-wahl Nov 12, 2021
b0b5c36
Merge branch 'master' into tracker_tracker
dale-wahl Nov 15, 2021
3a24204
Allow scraping as generator while still able to close Chrome
dale-wahl Nov 15, 2021
fc269b3
modify column_filter to return updated item w/ T/F by match value
dale-wahl Nov 19, 2021
a8e851a
fix space to tab
dale-wahl Nov 29, 2021
7fb4dac
must have commited a random space
dale-wahl Nov 29, 2021
d548b1d
Add collection of links to selenium_scraper
dale-wahl Nov 29, 2021
1adc5fc
Allow crawling additional subpages within domain
dale-wahl Nov 29, 2021
9457e5e
add docstring to validate_url
dale-wahl Dec 6, 2021
b554fef
remove url_scraper from defaults
dale-wahl Dec 6, 2021
59c6d01
Just leave it commented out so I don't have to memorize it
dale-wahl Dec 8, 2021
90614d9
correct url_scraper description
dale-wahl Dec 8, 2021
751f3f1
add clean_up method to worker to be inheritted/overwritten
dale-wahl Dec 8, 2021
4e7f968
remove after_search_completed from search worker
dale-wahl Dec 8, 2021
ac049e8
Use clean_up method to delete temporary files; make clear that
dale-wahl Dec 8, 2021
7a383ff
use clean_up to quit selenium webdriver and chrome
dale-wahl Dec 8, 2021
9bbaa3a
Merge branch 'master' into tracker_tracker
dale-wahl Jan 25, 2022
4cfc044
new selenium options
dale-wahl Jan 26, 2022
42cd9a7
random link selection in search_webpages
dale-wahl Jan 26, 2022
f48d703
default url_scraper on (at least in this branch!)
dale-wahl Jan 26, 2022
97b8006
Merge branch 'master' into tracker_tracker
dale-wahl Jan 28, 2022
97e5cc2
dammit, missed on merge.
dale-wahl Jan 28, 2022
a039b87
Adding archive specific scraper
dale-wahl Feb 18, 2022
4c32dd7
add missing variable... how did it work before?!
dale-wahl Feb 18, 2022
4649f6e
add to config
dale-wahl Feb 18, 2022
c83c671
uh duh; we ARE the scraper!
dale-wahl Feb 18, 2022
82f2725
added web archive variables
dale-wahl Feb 18, 2022
5135789
more missing 'self'
dale-wahl Feb 18, 2022
06515b6
error handling
dale-wahl Feb 18, 2022
b3d6324
web archive has http/https multiple times; didn't know that was possible
dale-wahl Feb 18, 2022
3c60b91
need more testing... but this will let me continue longer scrapes
dale-wahl Feb 18, 2022
b263bb0
even better error handling
dale-wahl Feb 18, 2022
829965a
provide url for collect results
dale-wahl Feb 18, 2022
91290ba
cast error to string for logging
dale-wahl Feb 21, 2022
7fa5290
don't add 'javascript' links to be scraped
dale-wahl Feb 21, 2022
ac3c317
change error to warning
dale-wahl Feb 21, 2022
1550388
Merge branch 'master' into tracker_tracker
dale-wahl Mar 3, 2022
e9b56c2
fix errors to str
dale-wahl Mar 3, 2022
0fd1e8f
default to firefox (need to work on Docker install)
dale-wahl Mar 3, 2022
4b85862
rename web_archive_scraper search and add new option for http requests
dale-wahl Mar 3, 2022
832b2cb
remove the old one!
dale-wahl Mar 3, 2022
04848a3
handling interruptions is good!
dale-wahl Mar 3, 2022
b2e0fcc
add aggregate column to column_filter
dale-wahl Mar 3, 2022
de30aa5
also return some of the surrounding text from the first match
dale-wahl Mar 3, 2022
ba7ec87
ensure restart if selenium/firefox crashes
dale-wahl Mar 7, 2022
7f755b5
fix error handling, logging, and http parameter
dale-wahl Mar 7, 2022
23494f2
add additional web archive redirect text
dale-wahl Mar 7, 2022
cae8533
try to capture errors
dale-wahl Mar 10, 2022
0e8f33a
refine and fix exclude urls
dale-wahl Mar 10, 2022
5228bda
missing parenthesis
dale-wahl Mar 10, 2022
b30a3b9
missed 'self'
dale-wahl Mar 10, 2022
a4e3229
2 is actually 4 attempts
dale-wahl Mar 11, 2022
e10be3c
add another type of bad response from web archive
dale-wahl Mar 11, 2022
3e06055
alert handling and better error handling
dale-wahl Mar 14, 2022
202bad5
fix some error handling
dale-wahl Mar 16, 2022
e75327b
Merge branch 'master' into tracker_tracker
dale-wahl Mar 18, 2022
4ccff79
move html to new html column and text to body column
dale-wahl Mar 18, 2022
60a6c0a
join text list with newlines for "body" column
dale-wahl Mar 18, 2022
431e941
remove null from csv
dale-wahl Mar 21, 2022
14837aa
add base_url and year to end result, form web archive links, exclude
dale-wahl Mar 28, 2022
a3fb64e
missing parenthesis
dale-wahl Mar 28, 2022
92f5a1d
fix function reference
dale-wahl Mar 28, 2022
08102b7
call preprocessed_urls
dale-wahl Mar 28, 2022
f753007
do not ignore outside domain (need better process to try that)
dale-wahl Mar 28, 2022
8912a6c
domain only
dale-wahl Mar 28, 2022
3fe1b65
fix reference
dale-wahl Mar 29, 2022
4fdf7ba
Merge branch 'master' into tracker_tracker
dale-wahl Apr 1, 2022
88cd26d
try and collect links with selenium
dale-wahl Apr 1, 2022
f4cab46
update column_filter to find multiple matches
dale-wahl Apr 4, 2022
5dfa4e8
fix up the normal url_scraper datasource
dale-wahl Apr 13, 2022
291947c
ensure all selenium links are strings for join
dale-wahl Apr 14, 2022
362a7cf
Merge branch 'master' into tracker_tracker
dale-wahl Apr 19, 2022
70c348b
change output of url_scraper to ndjson with map_items
dale-wahl Apr 19, 2022
1abd3e0
missed key/index change
dale-wahl Apr 19, 2022
bc52415
update web archive to use json and map to 4CAT
dale-wahl Apr 19, 2022
8b4b304
fix no text found
dale-wahl Apr 19, 2022
bc17019
and none on scraped_links
dale-wahl Apr 19, 2022
5ef863b
check key first
dale-wahl Apr 19, 2022
ee3d1e2
fix up web_archive error reporting
dale-wahl Apr 19, 2022
d7da48e
handle None type for error
dale-wahl Apr 19, 2022
55c0c67
record web archive "bad request"
dale-wahl Apr 19, 2022
9b26db7
add wait after redirect movement
dale-wahl Apr 19, 2022
d7af415
increase waittime for redirects
dale-wahl Apr 19, 2022
5da662e
add processor for trackers
dale-wahl Apr 26, 2022
e9b9f91
Merge branch 'master' into tracker_tracker
dale-wahl Apr 29, 2022
8b6f2f6
dict to list for addition
dale-wahl May 10, 2022
d70b974
allow both newline and comma seperated links
dale-wahl Jun 1, 2022
c7120d0
attempt to scrape iframes as seperate pages
dale-wahl Jun 16, 2022
d1cf378
Merge branch 'master' into tracker_tracker
dale-wahl Jul 28, 2022
b5b72b7
Fixes for selenium scraper to work with config database
dale-wahl Aug 2, 2022
d0a9ea4
installation of packages, geckodriver, and firefox if selenium enabled
dale-wahl Aug 2, 2022
c3783f5
update install instructions
dale-wahl Aug 2, 2022
9a25624
Merge branch 'master' into tracker_tracker
dale-wahl Sep 1, 2022
7520991
fix merge error
dale-wahl Sep 1, 2022
15bf4e2
fix dropped function
dale-wahl Sep 1, 2022
595f72c
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Sep 1, 2022
04eda94
have to be kidding me
dale-wahl Sep 1, 2022
067b3cc
add note; setup requires docker... need to think about IF this will ever
dale-wahl Sep 1, 2022
4cb2106
Merge branch 'master' into tracker_tracker
dale-wahl Sep 6, 2022
5f24394
Merge branch 'master' into tracker_tracker
dale-wahl Sep 8, 2022
04e17bf
Merge branch 'master' into tracker_tracker
dale-wahl Sep 8, 2022
7ef28b7
Merge branch 'master' into tracker_tracker
dale-wahl Sep 14, 2022
6029dcb
Merge branch 'master' into tracker_tracker
dale-wahl Sep 14, 2022
922c03a
seperate selenium class into wrapper and Search class so wrapper can be
dale-wahl Sep 14, 2022
9af88b6
add screenshots; add firefox extension support
dale-wahl Sep 15, 2022
40503d9
update selenium definitions
dale-wahl Sep 15, 2022
11194fb
regex for extracting urls from strings
dale-wahl Sep 15, 2022
ca3e82b
screenshots processor; extract urls from text and takes screenshots
dale-wahl Sep 15, 2022
2f10ff6
Allow producing zip files from data sources
stijn-uva Sep 15, 2022
4248f01
import time
dale-wahl Sep 15, 2022
fbfbe08
pick better default
dale-wahl Sep 15, 2022
7605e2b
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Sep 15, 2022
4fde2d6
test screenshot datasource
dale-wahl Sep 15, 2022
9e5c89c
validate all params
dale-wahl Sep 15, 2022
8338c77
fix enable extension
dale-wahl Sep 15, 2022
be4c508
haha break out of while loop
dale-wahl Sep 15, 2022
7d086a0
count my items
dale-wahl Sep 15, 2022
b172c64
whoops, len() is important here
dale-wahl Sep 15, 2022
b90cacf
must be getting tired...
dale-wahl Sep 15, 2022
6014030
remove redundant logging
dale-wahl Sep 15, 2022
41bd55b
Eager loading for screenshots, viewport options, etc
stijn-uva Sep 15, 2022
51a3fab
Woops, wrong folder
stijn-uva Sep 15, 2022
a9ab4e5
Fix label shortening
stijn-uva Sep 15, 2022
af1a92a
Just 'queue' instead of 'search queue'
stijn-uva Sep 15, 2022
e746e0d
Yeah, make it headless
stijn-uva Sep 15, 2022
86660d0
README -> DESCRIPTION
stijn-uva Sep 15, 2022
8794ed0
h1 -> h2
stijn-uva Sep 15, 2022
c59e5d5
Actually just have no header
stijn-uva Sep 15, 2022
5b541a0
Use proper filename for downloaded files
stijn-uva Sep 15, 2022
cdc9a60
Configure whether to offer pseudonymisation etc
stijn-uva Sep 15, 2022
66d2463
Tweak descriptions
stijn-uva Sep 15, 2022
8859f16
fix log missing data
dale-wahl Sep 20, 2022
d683fc5
Merge branch 'master' into tracker_tracker
dale-wahl Nov 15, 2022
014d016
add columns to post_topic_matrix
dale-wahl Nov 15, 2022
348869e
fix breadcrumb bug
dale-wahl Nov 15, 2022
41b8440
Add top topics column
dale-wahl Nov 16, 2022
a6790f0
Fix selenium config install parameter (Docker uses this/manual would
dale-wahl Nov 24, 2022
4786f5c
this processor is slow; i thought it was broken long before it updated!
dale-wahl Nov 24, 2022
97204ac
refactor detect_trackers as conversion processor not filter
dale-wahl Nov 24, 2022
bffda55
Merge branch 'master' into tracker_tracker
dale-wahl Jan 17, 2023
6485397
add geckodriver executable to docker install
dale-wahl Jan 17, 2023
b602d06
Merge branch 'master' into tracker_tracker
stijn-uva Feb 1, 2023
dfa8cb6
Auto-configure webdrivers if available in PATH
stijn-uva Feb 1, 2023
3521f8c
Merge branch 'master' into tracker_tracker
dale-wahl Feb 16, 2023
292d51b
use iterate_item to check map_item returns something if not warn admi…
dale-wahl May 23, 2023
deb7791
apparently something with spacy and typing_extensions is broken
dale-wahl May 23, 2023
6b5e31d
remove old debug print
dale-wahl May 24, 2023
635e205
wrap map_item with processor.get_mapped_item as well as check for map…
dale-wahl May 24, 2023
c4d6d41
conform iterate_mapped_items to new method
dale-wahl May 24, 2023
0fb19cc
get_columns needs to detect non CSV and NDJSON
dale-wahl May 30, 2023
3944165
Check Instagram items for ads
dale-wahl Jun 29, 2023
5ffd6dc
warn on instagram ads; add customizable warning message
dale-wahl Jun 29, 2023
869e1de
update screenshots to act as image-downloader and benefit from proces…
dale-wahl Aug 29, 2023
a15bf27
Merge branch 'master' into tracker_tracker
dale-wahl Aug 29, 2023
1f1ae1f
fix is_compatible_with
dale-wahl Aug 29, 2023
a3d3297
Delete helper-scripts/migrate/migrate-1.30-1.31.py
dale-wahl Aug 29, 2023
5c4732c
fix embeddings is_compatible_with
dale-wahl Aug 29, 2023
8b4d71c
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Aug 29, 2023
99f2991
fix up UI options for hashing and private
dale-wahl Aug 29, 2023
69425f6
abstract was moved to lib
dale-wahl Aug 29, 2023
d94a872
Merge branch 'master' into tracker_tracker
dale-wahl Aug 30, 2023
064c855
various fixes to selenium based datasources
dale-wahl Aug 30, 2023
a438560
processors not compatible with image datasets
dale-wahl Aug 30, 2023
a162bc4
update firefox extension handling
dale-wahl Aug 31, 2023
c5f41ab
screenshots datasource fix get_options
dale-wahl Aug 31, 2023
b786f20
rename screenshots processor to be detected as image dataset
dale-wahl Aug 31, 2023
bae531f
add monthly and weekly frequencies to wayback machine datasource
dale-wahl Sep 5, 2023
c0fc403
wayback ds: fix fail if all attempts do not realize results; addion f…
dale-wahl Sep 5, 2023
986cb11
add scroll down page to allow lazy loading for entire page screenshots
dale-wahl Sep 6, 2023
1522882
screenshots: adjust pause time so it can be used to force a wait for …
dale-wahl Sep 6, 2023
be16bda
hash URLs to create filenames
dale-wahl Sep 12, 2023
f2ce1c3
remove log
dale-wahl Sep 12, 2023
3e93182
add setting to toggle display advanced options
dale-wahl Sep 12, 2023
eea7ba6
add progress bars
dale-wahl Sep 12, 2023
efadeb0
web archive fix query validation
dale-wahl Sep 12, 2023
b18fad6
count subpages in progress
dale-wahl Sep 12, 2023
6c64ad1
remove overwritten function
dale-wahl Sep 12, 2023
2b9b74a
move http response to own column
dale-wahl Sep 12, 2023
59542f9
special filenames
dale-wahl Sep 13, 2023
5729002
add timestamps to all screenshots
dale-wahl Sep 13, 2023
6367e31
add healthcheck from master
dale-wahl Sep 13, 2023
1725a92
Merge branch 'master' into map_item_catch
dale-wahl Sep 13, 2023
b32ddca
do not always warn on map_item error (for example when getting datase…
dale-wahl Sep 13, 2023
2184e2e
ensure processor exists prior to checking map_item (processors may be…
dale-wahl Sep 13, 2023
a1d99d5
no mapping when there is no processor
dale-wahl Sep 13, 2023
4d43be3
restart selenium on failure
dale-wahl Sep 13, 2023
8b098b1
new build have selenium
dale-wahl Sep 13, 2023
fb6ea11
process urls after start (keep original query parameters)
dale-wahl Sep 13, 2023
8ce2f27
undo default firefox
dale-wahl Sep 13, 2023
4c33d6b
quick max
dale-wahl Sep 13, 2023
a96e34a
rename SeleniumScraper to SeleniumSearch
dale-wahl Sep 14, 2023
cbd2d3b
max number screenshots configurable
dale-wahl Sep 14, 2023
e7f55aa
method to get url with error handling
dale-wahl Sep 14, 2023
7870c0e
use get_with_error_handling
dale-wahl Sep 14, 2023
bf4d44f
d'oh, screenshot processor needs to quit selenium
dale-wahl Sep 14, 2023
a4b196e
update log to contain URL
dale-wahl Sep 14, 2023
0549ed4
Update scrolling to use Page down key if necessary
dale-wahl Sep 14, 2023
392c633
improve logs
dale-wahl Sep 14, 2023
13a6e8b
update image_category_wall as screenshot datasource does not have cat…
dale-wahl Sep 14, 2023
621c9e2
no category, no processor
dale-wahl Sep 14, 2023
f45f469
str errors
dale-wahl Sep 15, 2023
4c8f095
screenshots: dismiss alerts when checking ready state is complete
dale-wahl Sep 19, 2023
be790ae
set screenshot timeout to 30 seconds
dale-wahl Sep 19, 2023
3b42b2f
update gensim package
dale-wahl Sep 19, 2023
587185b
screenshots: move processor interrupt into attempts loop
dale-wahl Sep 19, 2023
83a728f
if alert disappears before we can dismiss it...
dale-wahl Sep 19, 2023
f226e67
Merge branch 'master' into tracker_tracker
dale-wahl Sep 19, 2023
f71110e
Merge branch 'master' into map_item_catch
dale-wahl Sep 19, 2023
5d9c5d5
do not warn unmappable for previews or creating CSV extract (warning …
dale-wahl Sep 19, 2023
92440f5
only warn admins once per dataset
dale-wahl Sep 19, 2023
a3e37b0
selenium specific logger
dale-wahl Sep 27, 2023
38441f0
do not switch window when no alert found on dismiss
dale-wahl Sep 27, 2023
91064e7
extract wait for page to load to selenium class
dale-wahl Sep 27, 2023
61dbbb5
improve descriptions of screenshot options
dale-wahl Sep 27, 2023
77fff0d
remove unused line
dale-wahl Sep 27, 2023
314a4bf
treat timeouts differently from other errors
dale-wahl Sep 27, 2023
394a3c5
debug if requested
dale-wahl Sep 27, 2023
70a61ca
increase pause time
dale-wahl Sep 27, 2023
ebfc037
restart browser w/ PID
dale-wahl Sep 27, 2023
ea76070
increase max_workers for selenium
dale-wahl Sep 27, 2023
15358a2
quick fix restart by pid
dale-wahl Sep 27, 2023
ffa5676
avoid bad urls
dale-wahl Sep 27, 2023
df7958b
Merge branch 'map_item_catch' into tracker_tracker
dale-wahl Sep 27, 2023
612b1f9
Merge branch 'master' into tracker_tracker
dale-wahl Nov 17, 2023
90a126b
missing bracket & attempt to fix-missing dependencies in Docker install
dale-wahl Nov 17, 2023
fc7f496
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Dec 6, 2023
df919e3
fix image_category_wall to be compatible only if two datasets (image …
dale-wahl Dec 6, 2023
3364d55
grumble - need it in is_compatible but don't have the dataset (obvi)
dale-wahl Dec 6, 2023
f253edf
screenshots processor enable firefox extension and warn when cannot use
dale-wahl Dec 19, 2023
b01c77d
add some warnings to track if firefox processes are building up and n…
dale-wahl Feb 20, 2024
809f5be
ensure clean_up is run even if failure in exception
dale-wahl Feb 20, 2024
4fc63be
set up gunicorn logging regardless of Docker
dale-wahl Feb 21, 2024
eab1b96
Merge branch 'master' into tracker_tracker
stijn-uva May 2, 2024
3ed29e1
web_archive fix parameter key
dale-wahl May 2, 2024
1b76f12
Improve max_sites stuff
stijn-uva May 2, 2024
a264832
mappeditem (when did this get updated?)
dale-wahl Jul 11, 2024
8069bb1
add frequency to web archive scraper
dale-wahl Sep 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion backend/abstract/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,6 @@ def work(self):
self.log.warning("Job %s/%s was queued for a dataset already marked as finished, deleting..." % (self.job.data["jobtype"], self.job.data["remote_id"]))
self.job.finish()


def after_process(self):
"""
Run after processing the dataset
Expand Down
13 changes: 13 additions & 0 deletions backend/abstract/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ def process(self):
elif posts is not None:
self.dataset.update_status("Query finished, no results found.")

self.after_search_completed()

# queue predefined post-processors
if num_posts > 0 and query_parameters.get("next", []):
for next in query_parameters.get("next"):
Expand Down Expand Up @@ -151,6 +153,17 @@ def get_items(self, query):
"""
pass

def after_search_completed(self):
"""
Method to use if anything needs to be run after the search method is completely finished.

Descending classes to implement if desired.

This is will allow get_items or search to act as a generator as it is called after items_to_csv or
items_to_ndjson is called.
"""
pass

def import_from_file(self, path):
"""
Import items from an external file
Expand Down
208 changes: 208 additions & 0 deletions backend/abstract/selenium_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
import abc
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException, SessionNotCreatedException

from backend.abstract.search import Search
from common.lib.exceptions import ProcessorException


class SeleniumScraper(Search, metaclass=abc.ABCMeta):
"""
Selenium Scraper class

Selenium utilizes a chrome webdriver and chrome browser to navigate and scrape the web. This processor can be used
to initialize that browser and navigate it as needed. It replaces search to allow you to utilize the Selenium driver
and ensure the webdriver and browser are properly closed out upon completion.
"""

driver = None
last_scraped_url = None

def simple_scrape_page(self, url, title_404_strings='default'):
"""
Simple helper to scrape url. Returns a dictionary containing basic results from scrape including final_url,
page_title, and page_source otherwise False if the page did not advance (self.check_for_movement() failed).
Does not handle errors from driver.get() (e.g., badly formed URLs, Timeouts, etc.).

Note: calls self.reset_current_page() prior to requesting url to ensure each page is uniquely checked.

You are invited to use this as a template for more complex scraping.

:param str url: url as string; beginning with scheme (e.g., http, https)
:param List title_404_strings: List of strings representing possible 404 text to be compared with driver.title
:return dict: A dictionary containing basic results from scrape including final_url, page_title, and page_source.
Returns false if no movement was detected
"""
self.reset_current_page()
# try:
self.driver.get(url)
# except WebDriverException as e:
# # restart selenium

if self.check_for_movement():
detected_404 = self.check_for_404(title_404_strings)
page_title = self.driver.title
current_url = self.driver.current_url
page_source = self.driver.page_source

return {
'original_url': url,
'final_url': current_url,
'page_title': page_title,
'page_source': page_source,
'detected_404': detected_404
}
else:
return False

def collect_links(self):
"""

"""
if self.driver is None:
raise ProcessorException('Selenium Drive not yet started: Cannot collect links')

elems = self.driver.find_elements_by_xpath("//a[@href]")
return [elem.get_attribute("href") for elem in elems]

def search(self, query):
"""
Search for items matching the given query

The real work is done by the get_items() method of the descending
class. This method just provides some scaffolding and post-processing
of results via `after_search()`, if it is defined.

:param dict query: Query parameters
:return: Iterable of matching items, or None if there are no results.
"""
self.start_selenium()
# Returns to default position; i.e., 'data:,'
self.reset_current_page()
# Sets timeout to 60
self.set_page_load_timeout()

# Normal Search function to be used To be implemented by descending classes!
try:
posts = self.get_items(query)
except Exception as e:
# Ensure Selenium always quits
self.quit_selenium()
raise e

if not posts:
return None

# search workers may define an 'after_search' hook that is called after
# the query is first completed
if hasattr(self, "after_search") and callable(self.after_search):
posts = self.after_search(posts)

return posts

def start_selenium(self):
"""
Start a headless browser
"""
options = Options()
options.headless = True
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

try:
self.driver = webdriver.Chrome(options=options)
except (SessionNotCreatedException, WebDriverException) as e:
if "binary not found" in str(e):
raise ProcessorException("Chromium binary is not available.")
if "only supports Chrome" in str(e):
raise ProcessorException("Your chromedriver version is incompatible with your Chromium version:\n (%s)" % e)
else:
raise ProcessorException("Could not connect to Chromium (%s)." % e)

def quit_selenium(self):
"""
Always attempt to close the browser otherwise multiple versions of Chrome will be left running.

And Chrome is a memory hungry monster.
"""
try:
self.driver.quit()
except:
pass

def after_search_completed(self):
dale-wahl marked this conversation as resolved.
Show resolved Hide resolved
"""
Runs after search (and thus the get_items generator) is used, allowing the Selenium driver to be used
dynamically and items to be yielded and written in turn
"""
self.quit_selenium()

def set_page_load_timeout(self, timeout=60):
"""
Adjust the time that Selenium will wait for a page to load before failing
"""
self.driver.set_page_load_timeout(timeout)

def check_for_movement(self):
"""
Some driver.get() commands will not result in an error even if they do not result in updating the page source.
This can happen, for example, if a url directs the browser to attempt to download a file. It can therefore be
important to check and ensure a new page was actually obtained before retrieving the page source as you will
otherwise retrieve he same information from the previous url.

WARNING: It may also be true that a url redirects to the same url as previous scraped url. This check would assume no
movement occurred. Use in conjunction with self.reset_current_page() if it is necessary to check every url results
and identify redirects.
"""
current_url = self.driver.current_url
if current_url == self.last_scraped_url:
return False
else:
return True

def reset_current_page(self):
"""
It may be desirable to "reset" the current page, for example in conjunction with self.check_for_movement(),
to ensure the results are obtained for a specific url provided.

Example: driver.get(url_1) is called and page_source is collected. Then driver.get(url_2) is called, but fails.
Depending on the type of failure (which may not be detected), calling page_source may return the page_source
from url_1 even after driver.get(url_2) is called.
"""
self.driver.get('data:,')
self.last_scraped_url = self.driver.current_url

def check_for_404(self, stop_if_in_title='default'):
"""
Checks page title for references to 404

Selenium does not have a "status code" in the same way the python requests and other libraries do. This can be
used to approximate a 404. Alternately, you could use another library to check for 404 errors but that can lead
to misleading results (as the new library will necessarily constitute a separate request).
More information here:
https://www.selenium.dev/documentation/worst_practices/http_response_codes/

Default values: ["page not found", "directory not found", "file not found", "404 not found", "error 404"]

:param list stop_if_in_title: List of strings representing possible 404 text
"""
if stop_if_in_title == 'default':
stop_if_in_title = ["page not found", "directory not found", "file not found", "404 not found", "error 404"]

if any(four_oh_four.lower() in self.driver.title.lower() for four_oh_four in stop_if_in_title):
return True
else:
return False

def enable_download_in_headless_chrome(self, download_dir):
"""
It is possible to allow the webbrowser to download files.
NOTE: this could introduce security risks.
"""
# add missing support for chrome "send_command" to selenium webdriver
self.driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')

params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
return self.driver.execute("send_command", params)
12 changes: 12 additions & 0 deletions common/lib/helpers.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Miscellaneous helper functions for the 4CAT backend
"""
from urllib.parse import urlparse
import subprocess
import datetime
import smtplib
Expand Down Expand Up @@ -540,3 +541,14 @@ def send_email(recipient, message):
smtp.sendmail(config.NOREPLY_EMAIL, recipient, message)
else:
smtp.sendmail(config.NOREPLY_EMAIL, recipient, message.as_string())


def validate_url(x):
dale-wahl marked this conversation as resolved.
Show resolved Hide resolved
if type(x) == str:
if x.count('http://') > 1 or x.count('https://') > 1:
# check for errors in spliting urls
return False
result = urlparse(x)
return all([result.scheme, result.netloc])
else:
return False
3 changes: 2 additions & 1 deletion config.py-example
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ DATASOURCES = {
"boards": "*",
},
"telegram": {},
"twitterv2": {}
"twitterv2": {},
"url_scraper": {},
dale-wahl marked this conversation as resolved.
Show resolved Hide resolved
}

# Configure how the tool is to be named in its web interface. The backend will
Expand Down
12 changes: 12 additions & 0 deletions datasources/url_scraper/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""
Initialize Selenium Url Scraper data source
"""

# An init_datasource function is expected to be available to initialize this
# data source. A default function that does this is available from the
# backend helpers library.
from common.lib.helpers import init_datasource

# Internal identifier for this data source
DATASOURCE = "url_scraper"
NAME = "Selenium Url Scraper"
Loading