Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information #205

Closed
wants to merge 239 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
1459ea3
Selenium base
dale-wahl Nov 12, 2021
722fd0f
Added a nice little simple search webpages datasource
dale-wahl Nov 12, 2021
120f2e4
Minor edits to ensure 4cat imports url_scraper
dale-wahl Nov 12, 2021
b0b5c36
Merge branch 'master' into tracker_tracker
dale-wahl Nov 15, 2021
3a24204
Allow scraping as generator while still able to close Chrome
dale-wahl Nov 15, 2021
fc269b3
modify column_filter to return updated item w/ T/F by match value
dale-wahl Nov 19, 2021
a8e851a
fix space to tab
dale-wahl Nov 29, 2021
7fb4dac
must have commited a random space
dale-wahl Nov 29, 2021
d548b1d
Add collection of links to selenium_scraper
dale-wahl Nov 29, 2021
1adc5fc
Allow crawling additional subpages within domain
dale-wahl Nov 29, 2021
9457e5e
add docstring to validate_url
dale-wahl Dec 6, 2021
b554fef
remove url_scraper from defaults
dale-wahl Dec 6, 2021
59c6d01
Just leave it commented out so I don't have to memorize it
dale-wahl Dec 8, 2021
90614d9
correct url_scraper description
dale-wahl Dec 8, 2021
751f3f1
add clean_up method to worker to be inheritted/overwritten
dale-wahl Dec 8, 2021
4e7f968
remove after_search_completed from search worker
dale-wahl Dec 8, 2021
ac049e8
Use clean_up method to delete temporary files; make clear that
dale-wahl Dec 8, 2021
7a383ff
use clean_up to quit selenium webdriver and chrome
dale-wahl Dec 8, 2021
9bbaa3a
Merge branch 'master' into tracker_tracker
dale-wahl Jan 25, 2022
4cfc044
new selenium options
dale-wahl Jan 26, 2022
42cd9a7
random link selection in search_webpages
dale-wahl Jan 26, 2022
f48d703
default url_scraper on (at least in this branch!)
dale-wahl Jan 26, 2022
97b8006
Merge branch 'master' into tracker_tracker
dale-wahl Jan 28, 2022
97e5cc2
dammit, missed on merge.
dale-wahl Jan 28, 2022
a039b87
Adding archive specific scraper
dale-wahl Feb 18, 2022
4c32dd7
add missing variable... how did it work before?!
dale-wahl Feb 18, 2022
4649f6e
add to config
dale-wahl Feb 18, 2022
c83c671
uh duh; we ARE the scraper!
dale-wahl Feb 18, 2022
82f2725
added web archive variables
dale-wahl Feb 18, 2022
5135789
more missing 'self'
dale-wahl Feb 18, 2022
06515b6
error handling
dale-wahl Feb 18, 2022
b3d6324
web archive has http/https multiple times; didn't know that was possible
dale-wahl Feb 18, 2022
3c60b91
need more testing... but this will let me continue longer scrapes
dale-wahl Feb 18, 2022
b263bb0
even better error handling
dale-wahl Feb 18, 2022
829965a
provide url for collect results
dale-wahl Feb 18, 2022
91290ba
cast error to string for logging
dale-wahl Feb 21, 2022
7fa5290
don't add 'javascript' links to be scraped
dale-wahl Feb 21, 2022
ac3c317
change error to warning
dale-wahl Feb 21, 2022
1550388
Merge branch 'master' into tracker_tracker
dale-wahl Mar 3, 2022
e9b56c2
fix errors to str
dale-wahl Mar 3, 2022
0fd1e8f
default to firefox (need to work on Docker install)
dale-wahl Mar 3, 2022
4b85862
rename web_archive_scraper search and add new option for http requests
dale-wahl Mar 3, 2022
832b2cb
remove the old one!
dale-wahl Mar 3, 2022
04848a3
handling interruptions is good!
dale-wahl Mar 3, 2022
b2e0fcc
add aggregate column to column_filter
dale-wahl Mar 3, 2022
de30aa5
also return some of the surrounding text from the first match
dale-wahl Mar 3, 2022
ba7ec87
ensure restart if selenium/firefox crashes
dale-wahl Mar 7, 2022
7f755b5
fix error handling, logging, and http parameter
dale-wahl Mar 7, 2022
23494f2
add additional web archive redirect text
dale-wahl Mar 7, 2022
cae8533
try to capture errors
dale-wahl Mar 10, 2022
0e8f33a
refine and fix exclude urls
dale-wahl Mar 10, 2022
5228bda
missing parenthesis
dale-wahl Mar 10, 2022
b30a3b9
missed 'self'
dale-wahl Mar 10, 2022
a4e3229
2 is actually 4 attempts
dale-wahl Mar 11, 2022
e10be3c
add another type of bad response from web archive
dale-wahl Mar 11, 2022
3e06055
alert handling and better error handling
dale-wahl Mar 14, 2022
202bad5
fix some error handling
dale-wahl Mar 16, 2022
e75327b
Merge branch 'master' into tracker_tracker
dale-wahl Mar 18, 2022
4ccff79
move html to new html column and text to body column
dale-wahl Mar 18, 2022
60a6c0a
join text list with newlines for "body" column
dale-wahl Mar 18, 2022
431e941
remove null from csv
dale-wahl Mar 21, 2022
14837aa
add base_url and year to end result, form web archive links, exclude
dale-wahl Mar 28, 2022
a3fb64e
missing parenthesis
dale-wahl Mar 28, 2022
92f5a1d
fix function reference
dale-wahl Mar 28, 2022
08102b7
call preprocessed_urls
dale-wahl Mar 28, 2022
f753007
do not ignore outside domain (need better process to try that)
dale-wahl Mar 28, 2022
8912a6c
domain only
dale-wahl Mar 28, 2022
3fe1b65
fix reference
dale-wahl Mar 29, 2022
4fdf7ba
Merge branch 'master' into tracker_tracker
dale-wahl Apr 1, 2022
88cd26d
try and collect links with selenium
dale-wahl Apr 1, 2022
f4cab46
update column_filter to find multiple matches
dale-wahl Apr 4, 2022
5dfa4e8
fix up the normal url_scraper datasource
dale-wahl Apr 13, 2022
291947c
ensure all selenium links are strings for join
dale-wahl Apr 14, 2022
362a7cf
Merge branch 'master' into tracker_tracker
dale-wahl Apr 19, 2022
70c348b
change output of url_scraper to ndjson with map_items
dale-wahl Apr 19, 2022
1abd3e0
missed key/index change
dale-wahl Apr 19, 2022
bc52415
update web archive to use json and map to 4CAT
dale-wahl Apr 19, 2022
8b4b304
fix no text found
dale-wahl Apr 19, 2022
bc17019
and none on scraped_links
dale-wahl Apr 19, 2022
5ef863b
check key first
dale-wahl Apr 19, 2022
ee3d1e2
fix up web_archive error reporting
dale-wahl Apr 19, 2022
d7da48e
handle None type for error
dale-wahl Apr 19, 2022
55c0c67
record web archive "bad request"
dale-wahl Apr 19, 2022
9b26db7
add wait after redirect movement
dale-wahl Apr 19, 2022
d7af415
increase waittime for redirects
dale-wahl Apr 19, 2022
5da662e
add processor for trackers
dale-wahl Apr 26, 2022
e9b9f91
Merge branch 'master' into tracker_tracker
dale-wahl Apr 29, 2022
8b6f2f6
dict to list for addition
dale-wahl May 10, 2022
d70b974
allow both newline and comma seperated links
dale-wahl Jun 1, 2022
c7120d0
attempt to scrape iframes as seperate pages
dale-wahl Jun 16, 2022
d1cf378
Merge branch 'master' into tracker_tracker
dale-wahl Jul 28, 2022
b5b72b7
Fixes for selenium scraper to work with config database
dale-wahl Aug 2, 2022
d0a9ea4
installation of packages, geckodriver, and firefox if selenium enabled
dale-wahl Aug 2, 2022
c3783f5
update install instructions
dale-wahl Aug 2, 2022
9a25624
Merge branch 'master' into tracker_tracker
dale-wahl Sep 1, 2022
7520991
fix merge error
dale-wahl Sep 1, 2022
15bf4e2
fix dropped function
dale-wahl Sep 1, 2022
595f72c
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Sep 1, 2022
04eda94
have to be kidding me
dale-wahl Sep 1, 2022
067b3cc
add note; setup requires docker... need to think about IF this will ever
dale-wahl Sep 1, 2022
4cb2106
Merge branch 'master' into tracker_tracker
dale-wahl Sep 6, 2022
5f24394
Merge branch 'master' into tracker_tracker
dale-wahl Sep 8, 2022
04e17bf
Merge branch 'master' into tracker_tracker
dale-wahl Sep 8, 2022
7ef28b7
Merge branch 'master' into tracker_tracker
dale-wahl Sep 14, 2022
6029dcb
Merge branch 'master' into tracker_tracker
dale-wahl Sep 14, 2022
922c03a
seperate selenium class into wrapper and Search class so wrapper can be
dale-wahl Sep 14, 2022
9af88b6
add screenshots; add firefox extension support
dale-wahl Sep 15, 2022
40503d9
update selenium definitions
dale-wahl Sep 15, 2022
11194fb
regex for extracting urls from strings
dale-wahl Sep 15, 2022
ca3e82b
screenshots processor; extract urls from text and takes screenshots
dale-wahl Sep 15, 2022
2f10ff6
Allow producing zip files from data sources
stijn-uva Sep 15, 2022
4248f01
import time
dale-wahl Sep 15, 2022
fbfbe08
pick better default
dale-wahl Sep 15, 2022
7605e2b
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Sep 15, 2022
4fde2d6
test screenshot datasource
dale-wahl Sep 15, 2022
9e5c89c
validate all params
dale-wahl Sep 15, 2022
8338c77
fix enable extension
dale-wahl Sep 15, 2022
be4c508
haha break out of while loop
dale-wahl Sep 15, 2022
7d086a0
count my items
dale-wahl Sep 15, 2022
b172c64
whoops, len() is important here
dale-wahl Sep 15, 2022
b90cacf
must be getting tired...
dale-wahl Sep 15, 2022
6014030
remove redundant logging
dale-wahl Sep 15, 2022
41bd55b
Eager loading for screenshots, viewport options, etc
stijn-uva Sep 15, 2022
51a3fab
Woops, wrong folder
stijn-uva Sep 15, 2022
a9ab4e5
Fix label shortening
stijn-uva Sep 15, 2022
af1a92a
Just 'queue' instead of 'search queue'
stijn-uva Sep 15, 2022
e746e0d
Yeah, make it headless
stijn-uva Sep 15, 2022
86660d0
README -> DESCRIPTION
stijn-uva Sep 15, 2022
8794ed0
h1 -> h2
stijn-uva Sep 15, 2022
c59e5d5
Actually just have no header
stijn-uva Sep 15, 2022
5b541a0
Use proper filename for downloaded files
stijn-uva Sep 15, 2022
cdc9a60
Configure whether to offer pseudonymisation etc
stijn-uva Sep 15, 2022
66d2463
Tweak descriptions
stijn-uva Sep 15, 2022
8859f16
fix log missing data
dale-wahl Sep 20, 2022
d683fc5
Merge branch 'master' into tracker_tracker
dale-wahl Nov 15, 2022
014d016
add columns to post_topic_matrix
dale-wahl Nov 15, 2022
348869e
fix breadcrumb bug
dale-wahl Nov 15, 2022
41b8440
Add top topics column
dale-wahl Nov 16, 2022
a6790f0
Fix selenium config install parameter (Docker uses this/manual would
dale-wahl Nov 24, 2022
4786f5c
this processor is slow; i thought it was broken long before it updated!
dale-wahl Nov 24, 2022
97204ac
refactor detect_trackers as conversion processor not filter
dale-wahl Nov 24, 2022
bffda55
Merge branch 'master' into tracker_tracker
dale-wahl Jan 17, 2023
6485397
add geckodriver executable to docker install
dale-wahl Jan 17, 2023
b602d06
Merge branch 'master' into tracker_tracker
stijn-uva Feb 1, 2023
dfa8cb6
Auto-configure webdrivers if available in PATH
stijn-uva Feb 1, 2023
3521f8c
Merge branch 'master' into tracker_tracker
dale-wahl Feb 16, 2023
292d51b
use iterate_item to check map_item returns something if not warn admi…
dale-wahl May 23, 2023
deb7791
apparently something with spacy and typing_extensions is broken
dale-wahl May 23, 2023
6b5e31d
remove old debug print
dale-wahl May 24, 2023
635e205
wrap map_item with processor.get_mapped_item as well as check for map…
dale-wahl May 24, 2023
c4d6d41
conform iterate_mapped_items to new method
dale-wahl May 24, 2023
0fb19cc
get_columns needs to detect non CSV and NDJSON
dale-wahl May 30, 2023
3944165
Check Instagram items for ads
dale-wahl Jun 29, 2023
5ffd6dc
warn on instagram ads; add customizable warning message
dale-wahl Jun 29, 2023
869e1de
update screenshots to act as image-downloader and benefit from proces…
dale-wahl Aug 29, 2023
a15bf27
Merge branch 'master' into tracker_tracker
dale-wahl Aug 29, 2023
1f1ae1f
fix is_compatible_with
dale-wahl Aug 29, 2023
a3d3297
Delete helper-scripts/migrate/migrate-1.30-1.31.py
dale-wahl Aug 29, 2023
5c4732c
fix embeddings is_compatible_with
dale-wahl Aug 29, 2023
8b4d71c
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Aug 29, 2023
99f2991
fix up UI options for hashing and private
dale-wahl Aug 29, 2023
69425f6
abstract was moved to lib
dale-wahl Aug 29, 2023
d94a872
Merge branch 'master' into tracker_tracker
dale-wahl Aug 30, 2023
064c855
various fixes to selenium based datasources
dale-wahl Aug 30, 2023
a438560
processors not compatible with image datasets
dale-wahl Aug 30, 2023
a162bc4
update firefox extension handling
dale-wahl Aug 31, 2023
c5f41ab
screenshots datasource fix get_options
dale-wahl Aug 31, 2023
b786f20
rename screenshots processor to be detected as image dataset
dale-wahl Aug 31, 2023
bae531f
add monthly and weekly frequencies to wayback machine datasource
dale-wahl Sep 5, 2023
c0fc403
wayback ds: fix fail if all attempts do not realize results; addion f…
dale-wahl Sep 5, 2023
986cb11
add scroll down page to allow lazy loading for entire page screenshots
dale-wahl Sep 6, 2023
1522882
screenshots: adjust pause time so it can be used to force a wait for …
dale-wahl Sep 6, 2023
be16bda
hash URLs to create filenames
dale-wahl Sep 12, 2023
f2ce1c3
remove log
dale-wahl Sep 12, 2023
3e93182
add setting to toggle display advanced options
dale-wahl Sep 12, 2023
eea7ba6
add progress bars
dale-wahl Sep 12, 2023
efadeb0
web archive fix query validation
dale-wahl Sep 12, 2023
b18fad6
count subpages in progress
dale-wahl Sep 12, 2023
6c64ad1
remove overwritten function
dale-wahl Sep 12, 2023
2b9b74a
move http response to own column
dale-wahl Sep 12, 2023
59542f9
special filenames
dale-wahl Sep 13, 2023
5729002
add timestamps to all screenshots
dale-wahl Sep 13, 2023
6367e31
add healthcheck from master
dale-wahl Sep 13, 2023
1725a92
Merge branch 'master' into map_item_catch
dale-wahl Sep 13, 2023
b32ddca
do not always warn on map_item error (for example when getting datase…
dale-wahl Sep 13, 2023
2184e2e
ensure processor exists prior to checking map_item (processors may be…
dale-wahl Sep 13, 2023
a1d99d5
no mapping when there is no processor
dale-wahl Sep 13, 2023
4d43be3
restart selenium on failure
dale-wahl Sep 13, 2023
8b098b1
new build have selenium
dale-wahl Sep 13, 2023
fb6ea11
process urls after start (keep original query parameters)
dale-wahl Sep 13, 2023
8ce2f27
undo default firefox
dale-wahl Sep 13, 2023
4c33d6b
quick max
dale-wahl Sep 13, 2023
a96e34a
rename SeleniumScraper to SeleniumSearch
dale-wahl Sep 14, 2023
cbd2d3b
max number screenshots configurable
dale-wahl Sep 14, 2023
e7f55aa
method to get url with error handling
dale-wahl Sep 14, 2023
7870c0e
use get_with_error_handling
dale-wahl Sep 14, 2023
bf4d44f
d'oh, screenshot processor needs to quit selenium
dale-wahl Sep 14, 2023
a4b196e
update log to contain URL
dale-wahl Sep 14, 2023
0549ed4
Update scrolling to use Page down key if necessary
dale-wahl Sep 14, 2023
392c633
improve logs
dale-wahl Sep 14, 2023
13a6e8b
update image_category_wall as screenshot datasource does not have cat…
dale-wahl Sep 14, 2023
621c9e2
no category, no processor
dale-wahl Sep 14, 2023
f45f469
str errors
dale-wahl Sep 15, 2023
4c8f095
screenshots: dismiss alerts when checking ready state is complete
dale-wahl Sep 19, 2023
be790ae
set screenshot timeout to 30 seconds
dale-wahl Sep 19, 2023
3b42b2f
update gensim package
dale-wahl Sep 19, 2023
587185b
screenshots: move processor interrupt into attempts loop
dale-wahl Sep 19, 2023
83a728f
if alert disappears before we can dismiss it...
dale-wahl Sep 19, 2023
f226e67
Merge branch 'master' into tracker_tracker
dale-wahl Sep 19, 2023
f71110e
Merge branch 'master' into map_item_catch
dale-wahl Sep 19, 2023
5d9c5d5
do not warn unmappable for previews or creating CSV extract (warning …
dale-wahl Sep 19, 2023
92440f5
only warn admins once per dataset
dale-wahl Sep 19, 2023
a3e37b0
selenium specific logger
dale-wahl Sep 27, 2023
38441f0
do not switch window when no alert found on dismiss
dale-wahl Sep 27, 2023
91064e7
extract wait for page to load to selenium class
dale-wahl Sep 27, 2023
61dbbb5
improve descriptions of screenshot options
dale-wahl Sep 27, 2023
77fff0d
remove unused line
dale-wahl Sep 27, 2023
314a4bf
treat timeouts differently from other errors
dale-wahl Sep 27, 2023
394a3c5
debug if requested
dale-wahl Sep 27, 2023
70a61ca
increase pause time
dale-wahl Sep 27, 2023
ebfc037
restart browser w/ PID
dale-wahl Sep 27, 2023
ea76070
increase max_workers for selenium
dale-wahl Sep 27, 2023
15358a2
quick fix restart by pid
dale-wahl Sep 27, 2023
ffa5676
avoid bad urls
dale-wahl Sep 27, 2023
df7958b
Merge branch 'map_item_catch' into tracker_tracker
dale-wahl Sep 27, 2023
612b1f9
Merge branch 'master' into tracker_tracker
dale-wahl Nov 17, 2023
90a126b
missing bracket & attempt to fix-missing dependencies in Docker install
dale-wahl Nov 17, 2023
fc7f496
Merge branch 'tracker_tracker' of https://github.com/digitalmethodsin…
dale-wahl Dec 6, 2023
df919e3
fix image_category_wall to be compatible only if two datasets (image …
dale-wahl Dec 6, 2023
3364d55
grumble - need it in is_compatible but don't have the dataset (obvi)
dale-wahl Dec 6, 2023
f253edf
screenshots processor enable firefox extension and warn when cannot use
dale-wahl Dec 19, 2023
b01c77d
add some warnings to track if firefox processes are building up and n…
dale-wahl Feb 20, 2024
809f5be
ensure clean_up is run even if failure in exception
dale-wahl Feb 20, 2024
4fc63be
set up gunicorn logging regardless of Docker
dale-wahl Feb 21, 2024
eab1b96
Merge branch 'master' into tracker_tracker
stijn-uva May 2, 2024
3ed29e1
web_archive fix parameter key
dale-wahl May 2, 2024
1b76f12
Improve max_sites stuff
stijn-uva May 2, 2024
a264832
mappeditem (when did this get updated?)
dale-wahl Jul 11, 2024
8069bb1
add frequency to web archive scraper
dale-wahl Sep 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions backend/lib/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -628,7 +628,7 @@ def write_csv_items_and_finish(self, data):
self.dataset.update_status("Finished")
self.dataset.finish(len(data))

def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED):
def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED, finish=True):
"""
Archive a bunch of files into a zip archive and finish processing

Expand All @@ -639,6 +639,7 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
files added to the archive will be used.
:param int compression: Type of compression to use. By default, files
are not compressed, to speed up unarchiving.
:param bool finish: Finish the dataset/job afterwards or not?
"""
is_folder = False
if issubclass(type(files), PurePath):
Expand All @@ -665,7 +666,8 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
if num_items is None:
num_items = done

self.dataset.finish(num_items)
if finish:
self.dataset.finish(num_items)

def create_standalone(self):
"""
Expand Down
30 changes: 25 additions & 5 deletions backend/lib/search.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import hashlib
import zipfile
import secrets
import shutil
import random
import json
import math
import csv
import os
import copy

from pathlib import Path
from abc import ABC, abstractmethod
Expand Down Expand Up @@ -71,18 +74,19 @@ def process(self):
items = self.import_from_file(query_parameters.get("file"))
else:
items = self.search(query_parameters)

except WorkerInterruptedException:
raise ProcessorInterruptedException("Interrupted while collecting data, trying again later.")

# Write items to file and update the DataBase status to finished
num_items = 0
if items:
self.dataset.update_status("Writing collected data to dataset file")
if results_file.suffix == ".ndjson":
num_items = self.items_to_ndjson(items, results_file)
elif results_file.suffix == ".csv":
if self.extension == "csv":
num_items = self.items_to_csv(items, results_file)
elif self.extension == "ndjson":
num_items = self.items_to_ndjson(items, results_file)
elif self.extension == "zip":
num_items = self.items_to_archive(items, results_file)
else:
raise NotImplementedError("Datasource query cannot be saved as %s file" % results_file.suffix)

Expand Down Expand Up @@ -361,6 +365,22 @@ def items_to_ndjson(self, items, filepath):

return processed

def items_to_archive(self, items, filepath):
"""
Save retrieved items as an archive

Assumes that items is an iterable with one item, a Path object
referring to a folder containing files to be archived. The folder will
be removed afterwards.

:param items:
:param filepath: Where to store the archive
:return int: Number of items
"""
num_items = len(os.listdir(items))
self.write_archive_and_finish(items, None, zipfile.ZIP_STORED, False)
return num_items


class SearchWithScope(Search, ABC):
"""
Expand Down Expand Up @@ -404,7 +424,7 @@ def search(self, query):
# proportion of items matches
# first, get amount of items for all threads in which matching
# items occur and that are long enough
thread_ids = tuple([post["thread_id"] for post in items])
thread_ids = tuple([item["thread_id"] for item in items])
self.dataset.update_status("Retrieving thread metadata for %i threads" % len(thread_ids))
try:
min_length = int(query.get("scope_length", 30))
Expand Down
Loading