-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telegram crawl improvements #444
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit dd85961696de3d01fa48cfbbac8a31a4374edc83 Author: sal-phd-desktop <[email protected]> Date: Mon Sep 23 14:37:50 2024 +0200 Only import bsky embed JS on front page, make divs wider commit 02f90bd1559d710360324e1dca116e8c5519f9fe Author: sal-phd-desktop <[email protected]> Date: Fri Sep 20 15:03:09 2024 +0200 Link to Bluesky in readme commit e675dd04a9ffb45cc72704763b7553fee6cf59a2 Merge: 070035eb 38418b2e Author: sal-phd-desktop <[email protected]> Date: Fri Sep 20 15:01:45 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 070035ebf0cf4065f32f00e78044bb24a22172bd Author: sal-phd-desktop <[email protected]> Date: Fri Sep 20 15:01:31 2024 +0200 Link to Bluesky in readme commit 38418b2ec1533f5e13c8d3f001903db0bfdab4af Author: Sal Hagen <[email protected]> Date: Thu Sep 19 17:27:00 2024 +0200 Host BlueSky widget ourselves commit e281eb8bdfad3ec4c800bec2a64e6ff3263a2f74 Author: Stijn Peeters <[email protected]> Date: Thu Sep 19 15:32:08 2024 +0200 Refactor module loading (#396) * Refactor module loading * Optionally inject modules when instantiating dataset object * pass modules in a few more places where possible I think that is everywhere in the frontend. Backend is a bit odd as we are passing dataset.modules when it is None and thus creating children that would require individual inits of ModuleCollector. Could be more to look at there. * Do not lazy-load modules * modules/all_modules * Squashed commit of the following: commit 3f2a62a124926cfeb840796f104a702878ac10e5 Author: Carsten Schnober <[email protected]> Date: Wed Sep 18 18:18:29 2024 +0200 Update Gensim to >=4.3.3, <4.4.0 (#450) * Update Gensim to >=4.3.3, <4.4.0 * update nltk as well --------- Co-authored-by: Dale Wahl <[email protected]> Co-authored-by: Sal Hagen <[email protected]> commit fee2c8c08617094f28496963da282d2e2dddeab7 Merge: 3d94b666 f8e93eda Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:19 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 3d94b666cedd0de4e0bee953cbf1d787fdc38854 Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:04 2024 +0200 FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page) commit f8e93edabe9013a2c1229caa4c454fab09620125 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 15:11:21 2024 +0200 Simple extensions page in Control Panel commit b5be128c7b8682fb233d962326d9118a61053165 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:08:13 2024 +0200 Remove 'docs' directory commit 1e2010af44817016c274c9ec9f7f9971deb57f66 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:07:38 2024 +0200 Forgot TikTok and Douyin commit c757dd51884e7ec9cf62ca1726feacab4b2283b7 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:01:31 2024 +0200 Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions commit ee7f4345478f923541536c86a5b06246deae03f6 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:00:40 2024 +0200 RIP Parler data source commit 11300f2430b51887823b280405de4ded4f15ede1 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:21:37 2024 +0200 Tuplestring commit 547265240eba81ca0ad270cd3c536a2b1dcf512d Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:15:29 2024 +0200 Pass user obj instead of str to ConfigWrapper in Processor commit b21866d7900b5d20ed6ce61ee9aff50f3c0df910 Author: Stijn Peeters <[email protected]> Date: Tue Sep 17 17:45:01 2024 +0200 Ensure request-aware config reader in user object when using config wrapper commit bbe79e4b0fe870ccc36cab7bfe7963b28d1948e3 Author: Sal Hagen <[email protected]> Date: Tue Sep 17 15:12:46 2024 +0200 Fix extension path walk for Windows commit d6064beaf31a6a85b0e34ed4f8126eb4c4fc07e3 Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:50:45 2024 +0200 Allow tags that have no users Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy commit b542ded6f976809ec88445e7b04f2c81b900188e Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:13:14 2024 +0200 Trailing slash in query results list commit a4bddae575b22a009925206a1337bdd89349e567 Author: Dale Wahl <[email protected]> Date: Mon Sep 16 13:57:23 2024 +0200 4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451) * domain only * fix reference * try and collect links with selenium * update column_filter to find multiple matches * fix up the normal url_scraper datasource * ensure all selenium links are strings for join * change output of url_scraper to ndjson with map_items * missed key/index change * update web archive to use json and map to 4CAT * fix no text found * and none on scraped_links * check key first * fix up web_archive error reporting * handle None type for error * record web archive "bad request" * add wait after redirect movement * increase waittime for redirects * add processor for trackers * dict to list for addition * allow both newline and comma seperated links * attempt to scrape iframes as seperate pages * Fixes for selenium scraper to work with config database * installation of packages, geckodriver, and firefox if selenium enabled * update install instructions * fix merge error * fix dropped function * have to be kidding me * add note; setup requires docker... need to think about IF this will ever be installed without Docker * seperate selenium class into wrapper and Search class so wrapper can be used in processors! * add screenshots; add firefox extension support * update selenium definitions * regex for extracting urls from strings * screenshots processor; extract urls from text and takes screenshots * Allow producing zip files from data sources * import time * pick better default * test screenshot datasource * validate all params * fix enable extension * haha break out of while loop * count my items * whoops, len() is important here * must be getting tired... * remove redundant logging * Eager loading for screenshots, viewport options, etc * Woops, wrong folder * Fix label shortening * Just 'queue' instead of 'search queue' * Yeah, make it headless * README -> DESCRIPTION * h1 -> h2 * Actually just have no header * Use proper filename for downloaded files * Configure whether to offer pseudonymisation etc * Tweak descriptions * fix log missing data * add columns to post_topic_matrix * fix breadcrumb bug * Add top topics column * Fix selenium config install parameter (Docker uses this/manual would need to run install_selenium, well, manually) * this processor is slow; i thought it was broken long before it updated! * refactor detect_trackers as conversion processor not filter * add geckodriver executable to docker install * Auto-configure webdrivers if available in PATH * update screenshots to act as image-downloader and benefit from processors * fix is_compatible_with * Delete helper-scripts/migrate/migrate-1.30-1.31.py * fix embeddings is_compatible_with * fix up UI options for hashing and private * abstract was moved to lib * various fixes to selenium based datasources * processors not compatible with image datasets * update firefox extension handling * screenshots datasource fix get_options * rename screenshots processor to be detected as image dataset * add monthly and weekly frequencies to wayback machine datasource * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily * add scroll down page to allow lazy loading for entire page screenshots * screenshots: adjust pause time so it can be used to force a wait for images to load I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine * hash URLs to create filenames * remove log * add setting to toggle display advanced options * add progress bars * web archive fix query validation * count subpages in progress * remove overwritten function * move http response to own column * special filenames * add timestamps to all screenshots * restart selenium on failure * new build have selenium * process urls after start (keep original query parameters) * undo default firefox * quick max * rename SeleniumScraper to SeleniumSearch todo: build SeleniumProcessor! * max number screenshots configurable * method to get url with error handling * use get_with_error_handling * d'oh, screenshot processor needs to quit selenium * update log to contain URL * Update scrolling to use Page down key if necessary * improve logs * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way. Also, could I get categories from the metadata? That's... ugh. * no category, no processor * str errors * screenshots: dismiss alerts when checking ready state is complete * set screenshot timeout to 30 seconds * update gensim package * screenshots: move processor interrupt into attempts loop * if alert disappears before we can dismiss it... * selenium specific logger * do not switch window when no alert found on dismiss * extract wait for page to load to selenium class * improve descriptions of screenshot options * remove unused line * treat timeouts differently from other errors these are more likely due to an issue with the website in question * debug if requested * increase pause time * restart browser w/ PID * increase max_workers for selenium this is by individual worker class not for all selenium classes... so you can really crank them out if desired * quick fix restart by pid * avoid bad urls * missing bracket & attempt to fix-missing dependencies in Docker install * Allow dynamic form options in processors * Allow 'requires' on data source options as well * Handle list values with requires * basic processor for apple store; setup checks for additional requirements * fix is_4cat_class * show preview when no map_item * add google store datasource * Docker setup.py use extensions * Wider support for file upload in processors * Log file uploads in DMI service manager * add map_item methods and record more data per item need additional item data as map_item is staticmethod * update from master; merge conflicts * fix docker build context (ignore data files) * fix option requirements * apple store fix: list still tries to get query * apple & google stores fix up item mapping * missed merge error * minor fix * remove unused import * fix datasources w/ files frontend error * fix error w/ datasources having file option * better way to name docker volumes * update two other docker compose files * fix docker-compose ymls * minor bug: fix and add warning; fix no results fail * update apple field names to better match interface * update google store fieldnames and order * sneak in jinja logger if needed * fix fourcat.js handling checkboxes for dynamic settings * add new endpoint for app details to apple store * apple_store map new beta app data * add default lang/country * not all apps have advisories * revert so button works * add chart positions to beta map items * basic scheduler To-do - fix up and add options to scheduler view (e.g. delete/change) - add scheduler view to navigator - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view) - more testing... * update scheduler view, add functions to update job interval * revert .env * working scheduler! * basic scheduler view w/ datasets * fix postgres tag * update job status in scheduled_jobs table * fix timestamp; end_date needed for last run check; add dataset label * improve scheduler view * remove dataset from scheduled_jobs table on delete * scheduler view order by last creation * scheduler views: separate scheduler list from scheduled dataset list * additional update from master fixes * apple_store map_items fix missing locales * add back depth for pagination * correct route * modify pagination to accept args * pagination fun * pagination: i hate testing on live servers... * ok ok need the pagination route * pagination: add route_args * fix up scheduler header * improve app store descriptions * add azure store * fix azure links * azure_store: add category search * azure fix type of config update timestamp OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly * basic aws store * check if selenium available; get correct app_id * aws: implement pagination * add logging; wait for elements to load after next page; attempts to rework filter option collection * apple_store: handle invalid param error * fix filter_options * aws: fix filter option collection! * more merge * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations * migrate.py to run extension "fourcat_install.py" files * formatting * remove extensions; add gitignore * excise scheduler merge * some additional cleanup from app_studies branch * allow nested datasources folders; ignore files in extensions main folder * allow extension install scripts to run pip if migrate.py has not * Remove unused URL functions we could use ural for * Take care of git commit hash tracking for extension processors * Get rid of unused path.versionfile config setting * Add extensions README * Squashed commit of the following: commit cd356f7a69d15e8ecc8efffc6d63a16368e62962 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c0a11803a6bb411f15099d50fea25f10ab Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c2f9a368dbe0fc03264fb387e44313213b Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314ec83fb631d985926a0b5c5c440cfaba9aa Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2ebac382bd41b92da4481ff7d832dc1538 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd75a7f48ba4537449127e5fa39debf4fdf3 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters commit 2ef5c80f2d1a5b5f893c8977d8394740de6d796d Author: Stijn Peeters <[email protected]> Date: Tue Sep 3 12:05:14 2024 +0200 Actually check progress in text annotator commit 693960f41b73e39eda0c2f23eb361c18bde632cd Author: Stijn Peeters <[email protected]> Date: Mon Sep 2 18:03:18 2024 +0200 Add processor for stormtrooper DMI service commit 6ae964aad492527bc5d016a00f870145aab6e1af Author: Stijn Peeters <[email protected]> Date: Fri Aug 30 17:31:37 2024 +0200 Fix reference to old stopwords list in neologisms preset * Fix Github links for extensions * Fix commit detection in extensions * Fix extension detection in module loader * Follow symlinks when loading extensions Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir * Make queue message on create page more generic * Markdown in datasource option tooltips * Remove Spacy model from requirements * Add software_source to database SQL --------- Co-authored-by: Stijn Peeters <[email protected]> Co-authored-by: Stijn Peeters <[email protected]> commit cd356f7a69d15e8ecc8efffc6d63a16368e62962 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c0a11803a6bb411f15099d50fea25f10ab Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c2f9a368dbe0fc03264fb387e44313213b Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314ec83fb631d985926a0b5c5c440cfaba9aa Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2ebac382bd41b92da4481ff7d832dc1538 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd75a7f48ba4537449127e5fa39debf4fdf3 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters commit 2ef5c80f2d1a5b5f893c8977d8394740de6d796d Author: Stijn Peeters <[email protected]> Date: Tue Sep 3 12:05:14 2024 +0200 Actually check progress in text annotator commit 693960f41b73e39eda0c2f23eb361c18bde632cd Author: Stijn Peeters <[email protected]> Date: Mon Sep 2 18:03:18 2024 +0200 Add processor for stormtrooper DMI service commit 6ae964aad492527bc5d016a00f870145aab6e1af Author: Stijn Peeters <[email protected]> Date: Fri Aug 30 17:31:37 2024 +0200 Fix reference to old stopwords list in neologisms preset commit 4ba872bef2968f7f8bf5831fd3a4f413420b36ed Author: Dale Wahl <[email protected]> Date: Tue Aug 27 13:04:46 2024 +0200 fix hatebase: default column option for OPTION_MULTI_SELECT must be list commit e276033542f2d22e7f614f318a01d65114a21482 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Aug 21 12:53:10 2024 +0200 Bump nltk from 3.6.7 to 3.9 (#447) Bumps [nltk](https://github.com/nltk/nltk) from 3.6.7 to 3.9. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](https://github.com/nltk/nltk/compare/3.6.7...3.9) --- updated-dependencies: - dependency-name: nltk dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 1d749c3cf83b130ba70bdb09174f382d6711a14b Author: sal-phd-desktop <[email protected]> Date: Wed Aug 21 12:52:54 2024 +0200 Set UTF-8 encoding when opening stop words (fixes Windows bug) commit a03e5fd4252e7242563c291558606440256eb3d1 Author: Dale Wahl <[email protected]> Date: Mon Aug 19 14:19:21 2024 +0200 remove duplicate line commit aa07e8c13c2d59c6b699f78133036514659ee420 Author: Dale Wahl <[email protected]> Date: Mon Jul 29 09:35:22 2024 +0200 tweet import fix: author banner key missing when author has no banner commit 32dac5d2ffb936210f12f5c725514fd25a0286f1 Author: Dale Wahl <[email protected]> Date: Mon Jul 29 08:52:08 2024 +0200 tell user when dataset is not found we could have a proper 404 page, but at least leave a message commit 2c8c860fc5378113d1352016ac26ca761adecb32 Author: Dale Wahl <[email protected]> Date: Mon Jul 22 17:41:00 2024 +0200 telegram fix: reactions datastructure commit 1c0bf5e580eb16d8a6f9afa415f9febce449a537 Author: Dale Wahl <[email protected]> Date: Mon Jul 22 11:19:52 2024 +0200 fix telegram: crawl_max_depth can be None if it is not enabled for a user commit 3dfe7af292b33574a31630e3a0da10954ed87d0a Author: Dale Wahl <[email protected]> Date: Fri Jul 19 11:52:31 2024 +0200 fix more config.get() magic commit 2453182bcee6e54b396b762ab77b60b8a0893638 Author: Dale Wahl <[email protected]> Date: Fri Jul 19 10:54:23 2024 +0200 config_manager - fix `get_all` w/ one results (super rare edge); fix overwriting self.db in `with_db` commit 6b9cb0b5479e6e64e09a49fa2ca9effe1c5a7415 Author: Dale Wahl <[email protected]> Date: Wed Jul 17 15:20:49 2024 +0200 add surf nginx init file commit 5e984e13a08d9fba7d5806a7ef4e012ce7d57319 Author: Dale Wahl <[email protected]> Date: Wed Jul 17 14:30:34 2024 +0200 change port for surf commit 2ce8c354e90f939a16dad3f0155fd7d79405c79e Author: Dale Wahl <[email protected]> Date: Wed Jul 17 12:54:11 2024 +0200 use latest image on surf commit 13ec0fd3f2bed86c3b2dff73014093a6a92fbfb5 Author: Dale Wahl <[email protected]> Date: Wed Jul 17 12:46:59 2024 +0200 update surf docker-compose.yml this may require a new release commit 78698f6ac1b22b1154d31f69543ba7b266d33191 Author: Dale Wahl <[email protected]> Date: Wed Jul 17 10:34:56 2024 +0200 clip: handle new and old format commit eb7693780cb191403f107817ca30d90373929bf0 Author: Dale Wahl <[email protected]> Date: Tue Jul 16 14:27:08 2024 +0200 DMI SM updates to use status endpoint w/ database records; run on CPU if no GPU enabled commit d2a787e2c1559417bb5401f3208c82954052504f Author: Stijn Peeters <[email protected]> Date: Mon Jul 15 15:58:06 2024 +0200 Require most recent Telethon version commit 346150bd9cc96ac099abd4d15fa3de39bd65e9d1 Author: Stijn Peeters <[email protected]> Date: Mon Jul 15 15:57:55 2024 +0200 Catch UPDATE_APP_TO_LOGIN in Telegram commit 04acc06e95098d7e2f9b4af404447c9cfaee5b99 Author: Stijn Peeters <[email protected]> Date: Mon Jul 15 11:27:30 2024 +0200 Unbreak Twitter error handling commit e9b5232a963be02c2e86dabacb607b2315a4e0e6 Author: Stijn Peeters <[email protected]> Date: Fri Jul 12 13:27:15 2024 +0200 Ensure str type when trying to extract video URLs from a field commit d69dd6f337cac05ed31c05334890679976a1e6de Author: Stijn Peeters <[email protected]> Date: Fri Jul 12 12:31:14 2024 +0200 Make CSV column mapping params look nicer on result page commit 9bd9da568f593085a8d54744836e3290a75b51a7 Author: Stijn Peeters <[email protected]> Date: Fri Jul 12 12:22:03 2024 +0200 Add "empty" and "current timestamp" as options to CSV mapping commit 0b574571952a206904440faf8601ddf95ab42b24 Author: Dale Wahl <[email protected]> Date: Thu Jul 11 16:59:56 2024 +0200 image_wall: backup fit method commit eeb1ddeb7ca85b6802dfed3c74d1352062383d50 Merge: 2504c37b 43239467 Author: Stijn Peeters <[email protected]> Date: Thu Jul 11 16:47:45 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 43239467db046eea5eb5268f91d1b63a1042238d Author: Dale Wahl <[email protected]> Date: Thu Jul 11 12:08:08 2024 +0200 fix processor more button would only show top level analysis if not logged in commit d6ab2b0783f8e40ecd8fadbc2abccffa6f093e39 Author: Dale Wahl <[email protected]> Date: Tue Jul 9 15:35:25 2024 +0200 search_gab - use MappedItem commit 2504c37b67ff6f19720b44d8bb6054b1c3d5a155 Author: Stijn Peeters <[email protected]> Date: Sat Jul 6 17:51:22 2024 +0200 Fix multiline spacing in multi select list commit fea66ce38be0717da6c1f847e7124f7069c096e2 Author: Dale Wahl <[email protected]> Date: Fri Jul 5 13:15:45 2024 +0200 use processor media_type if dataset does not have media_type; set default media_type for downloaders commit d41fa34514e8177efdac7e64a31f2ee75c7d1652 Author: Dale Wahl <[email protected]> Date: Fri Jul 5 12:57:18 2024 +0200 video_hasher: handle no metadata file commit 2820dcecc36ed4705a2776064d387ff7ed14e84f Author: Dale Wahl <[email protected]> Date: Fri Jul 5 12:50:09 2024 +0200 num_rows not num_items() commit fb09162db902fa22fdf2d7a3ed171ce1489bd92f Author: Dale Wahl <[email protected]> Date: Fri Jul 5 12:44:03 2024 +0200 Google vision API returning 400s; properly log and record processed entries; google networks should not run on empty datasets commit ebf39d8262d199895aedc4f7fa275c5685e58563 Author: Dale Wahl <[email protected]> Date: Fri Jul 5 12:28:13 2024 +0200 fix image_category_wall whoops, cleared categories and post_values after filling them! commit 1ad9ec2c2e76604793ec37584c051f116af2fdab Author: Stijn Peeters <[email protected]> Date: Fri Jul 5 12:03:54 2024 +0200 fsdfdsgd sorry commit c7254c08a477c6cdc8497507e8452c3eff7101c9 Author: Stijn Peeters <[email protected]> Date: Fri Jul 5 12:01:21 2024 +0200 Fix razdel versioning commit b9a327abe99f2d9ede4f2747f34f20d1dc6803cb Author: Stijn Peeters <[email protected]> Date: Fri Jul 5 11:57:47 2024 +0200 Reorganise tokeniser, stopwords commit fb13bc483af9ba0d677ee35fd045bf36ab1cddf7 Merge: 0b745692 e3046496 Author: Stijn Peeters <[email protected]> Date: Fri Jul 5 11:56:08 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit e30464964262870c54c73f65a3bce630d6576f45 Author: Dale Wahl <[email protected]> Date: Fri Jul 5 10:51:53 2024 +0200 media_upload allow setting for max_form_part and warn users of failure above certain number of files commit e4f982b4550b352a5d1a131abd78d52e6c196e48 Author: Dale Wahl <[email protected]> Date: Fri Jul 5 09:50:49 2024 +0200 Update media_import help text; looks like failure happens somewhere between 600-1000 files due to Flask request size limits commit 0b74569280f8f87376a964a6b160ea1993cb3354 Author: Stijn Peeters <[email protected]> Date: Thu Jul 4 17:55:36 2024 +0200 Add razdel as option for Russian tokenisation commit 9f15a2b8d666c3b6fddeb151b7c424cb44df18a6 Author: Dale Wahl <[email protected]> Date: Thu Jul 4 17:13:15 2024 +0200 remove the log commit ffcb6a4239075ba190fb534b25b89507e09e5f56 Author: Dale Wahl <[email protected]> Date: Thu Jul 4 17:12:43 2024 +0200 Inform user if too many files are uploaded I do not understand why this is appearing. app.config['MAX_CONTENT_LENGTH'] is set to None. Problem persists in Flask alone (i.e., does not appear to be Gunicorn/Nginx/Apache). commit 9cad12dd6f64a63c48d3b5b304b5c7d9d1a6ddb7 Author: Stijn Peeters <[email protected]> Date: Thu Jul 4 15:09:42 2024 +0200 Bump version commit aad94f393de77cc9d4f578e1f5be66a3601a4c90 Author: Dale Wahl <[email protected]> Date: Thu Jul 4 10:51:01 2024 +0200 Update setup.py to ensure videohash updates commit d9154a6f9c46a5c793909b88da751bc71d6f759f Author: Dale Wahl <[email protected]> Date: Tue Jul 2 17:45:26 2024 +0200 clip: categorizing requires categories... seriously, guys? commit 0af9a5ec49bd2bcfbb87bda33976c65683f68777 Author: Dale Wahl <[email protected]> Date: Tue Jul 2 17:31:49 2024 +0200 blip2: fix no metadata file found (uploads...) commit d695053f440bd938a57f06adea7b9c732ecf30d7 Author: Dale Wahl <[email protected]> Date: Tue Jul 2 17:25:26 2024 +0200 cat_vis_wall - use str as category type if mixed i.e., use floats as string categories commit bcb914076760ea1fb0e277cdcd1782ffa101b535 Author: Sal Hagen <[email protected]> Date: Tue Jul 2 16:06:43 2024 +0200 Add Twitter author profile pic and banner URLs commit 1b3b02f826578e8f702ea84a27c8ced7b1fab345 Author: Dale Wahl <[email protected]> Date: Tue Jul 2 11:42:50 2024 +0200 add migrate.py log file in Docker commit 2aaa972e6888743fc329d721c37fa626cf2eeae3 Author: Dale Wahl <[email protected]> Date: Tue Jul 2 11:42:22 2024 +0200 add necessary pip packages for upgrade in Docker environment; add error logging and save to file for trouble shooting commit 18b8a53c01b334e0f70610b1305d380b25dbe9c6 Author: Dale Wahl <[email protected]> Date: Tue Jul 2 11:41:36 2024 +0200 update Dockerfile to keep build environment useful for interactive upgrade commit 7b224b9b798c9aaf956b5b618b98d742c4a2e7cd Author: Dale Wahl <[email protected]> Date: Tue Jul 2 11:41:12 2024 +0200 remove docker-compose.yml versions commit acf5de0ed02e144b920a80abfdfa35986dd0ed4c Author: Stijn Peeters <[email protected]> Date: Mon Jul 1 17:38:32 2024 +0200 Better issues.md, footer link commit 1953ca3895656ca9a12d2657e58019795ae64b3a Author: Dale Wahl <[email protected]> Date: Mon Jul 1 12:00:07 2024 +0200 FIX: get_key() is more of a creating of a key then general getting of a key... commit 12289bb5c766d1af23799ff11278b46b48fc2841 Author: Dale Wahl <[email protected]> Date: Mon Jul 1 11:37:06 2024 +0200 .metadata.json may not have top_parent via Media Uploader This may exist in other processors if a proper check is not in place; will need to review commit 25f4ed65ec2c32298a90490cf51037a7ea2d0bf9 Author: Dale Wahl <[email protected]> Date: Tue Jun 25 14:43:40 2024 +0200 Media upload datasource! (#419) * basic changes to allow files box * basic imports, yay! * video_scene_timelines to work on video imports! * add is_compatible_with checks to processors that cannot run on new media top_datasets * more is_compatible fixes * necessary function for checking media_types * enable more processors on media datasets * consolidate user_input file type * detect mimetype from filename best I can do without downloading all the files first. * handle zip archives; allow log and metadata files * do not count metadata or log files in num_files * move machine learning processors so they can be imported elsewhere * audio_to_text datasource * When validating zip file uploads, send list of file attributes instead of the first 128K of the zip file * Check type of files in zip when uploading media * Skip useless files when uploading media as zip * check multiple zip types in JS * js !=== python * fix media_type for loose file imports; fix extension for audio_to_text preset; fix merge for some processors w/ media_type --------- Co-authored-by: Stijn Peeters <[email protected]> commit 4ce689bdc3e441a7adf85883ddcda6bae0525ed9 Author: Stijn Peeters <[email protected]> Date: Mon Jun 24 11:58:50 2024 +0200 Avoid KeyError commit 155522d0817d19ac7b6b0b0164242156d6f7443a Author: Dale Wahl <[email protected]> Date: Thu Jun 20 15:58:21 2024 +0200 add generated images to image wall w/ text visual commit eecde519eab1208eeb6ee53c2d8febff7fb8febf Author: Dale Wahl <[email protected]> Date: Thu Jun 20 15:57:56 2024 +0200 allow users to NOT generate all images from prompts commit d0b9574093a109997e63b1062b2bdd8e71300a29 Author: Stijn Peeters <[email protected]> Date: Wed Jun 19 16:28:26 2024 +0200 ...don't mangle URLs in preview links commit c105e368a521ec54ae717bb9eb2fe9fae66cf6e8 Merge: 0028a999 8d4f99b2 Author: Dale Wahl <[email protected]> Date: Wed Jun 19 16:25:36 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 0028a9994d698611dd8b546b9b3bccbeec30b74f Author: Dale Wahl <[email protected]> Date: Wed Jun 19 16:25:12 2024 +0200 add followups to processors commit 8d4f99b22e0308606c7f713ef704dfa939e85247 Author: Stijn Peeters <[email protected]> Date: Wed Jun 19 16:17:22 2024 +0200 More flexible URL linking in CSV preview commit f4f8e6621bd6f2504dc3afc2078280bf5edb6444 Author: Dale Wahl <[email protected]> Date: Wed Jun 19 13:54:00 2024 +0200 tokeniser fix: use default lang for word_tokenize if language is 'other' commit 127472e91d8e510f3de2a9cc4a87be6cf2d0deaa Author: Stijn Peeters <[email protected]> Date: Tue Jun 18 16:45:01 2024 +0200 Better log messages for Telegram data source commit e8714b6fba72e00c690a8d643d8dc54d2250c94a Author: Stijn Peeters <[email protected]> Date: Mon Jun 17 17:42:21 2024 +0200 Add 'crawl' feature to Telegram data source Fixes #321 (though might need a bit more testing) commit 25fded7b596097f7916e1793f1841bae2b63d453 Merge: d67cf440 b10e3bb8 Author: sal-phd-desktop <[email protected]> Date: Fri Jun 14 16:23:02 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit d67cf440730ea1d4e124c76a4c21d65b56f39c68 Author: sal-phd-desktop <[email protected]> Date: Fri Jun 14 16:22:59 2024 +0200 Fix export 4chan script and remove some unecessary code commit b10e3bb8f0c8a67aa5fdbba1962301d8acdf625c Author: Dale Wahl <[email protected]> Date: Thu Jun 13 15:14:06 2024 +0200 video_hasher prefix: fix extension type commit ba565cdaa2ebeecf23fd60889d546c76b9ea5eb1 Author: Dale Wahl <[email protected]> Date: Thu Jun 13 14:53:13 2024 +0200 video_hasher: fix to work with Pillow updates; add max amount videos commit 90da5d231eff6a4249bef5468fcdbf1ebcf9247a Author: Dale Wahl <[email protected]> Date: Thu Jun 13 10:25:24 2024 +0200 image_cat_wall fix the fix commit a8b943d8e2c5471f82ea0442e2659d84fe8d9760 Author: Dale Wahl <[email protected]> Date: Wed Jun 12 13:29:41 2024 +0200 add OCR processor to image w/ text visualization commit e7e636b6b89b6163fa6976e67edba68e7d75b7ac Author: Dale Wahl <[email protected]> Date: Tue Jun 11 15:23:12 2024 +0200 add image_wall_w_text to follow on BLIP captions commit f74b97827f0465baf8483040471a77e4654e70b1 Author: Dale Wahl <[email protected]> Date: Thu Jun 6 11:05:25 2024 +0200 image_category_wall: allow multiple images per item/post commit e3c9ea57d46b32ba47b00a6047a278ddd530adc1 Author: Dale Wahl <[email protected]> Date: Thu May 30 16:27:50 2024 +0200 image_category_wall convert None to str for category commit 00874576c354235f4655f1d433ec4382010e18e3 Author: Dale Wahl <[email protected]> Date: Thu May 30 14:54:51 2024 +0200 image_category_wall fix float categories commit e0c55a8ae132bedef5da27ecbbb9489a094d454c Author: Dale Wahl <[email protected]> Date: Thu May 30 12:51:42 2024 +0200 download_images fix divide by zero when user can download all commit 3580fc9450501262badb8e61ef4b4df4b4c54322 Author: Dale Wahl <[email protected]> Date: Thu May 30 12:51:24 2024 +0200 image_category_wall remove 'max' when user can use all images commit f2145bdeff1d68e46cdd3521ecbb61573f01a2f2 Author: Dale Wahl <[email protected]> Date: Wed May 29 17:59:23 2024 +0200 rank_attributes: option to count missing data or blanks commit 01e7ab9677a75181bbedc62fa00e636ce2b17c18 Author: Dale Wahl <[email protected]> Date: Wed May 29 16:53:57 2024 +0200 fix missing field strategy so default_stategy not overwritten on second loop default_stategy would be set to correctly to the callable, but overwritten on second loop (and map_missing is a dictionary at that point). commit 097f838af1f5f2748578dd9072eb9e3a8b3a7057 Author: Dale Wahl <[email protected]> Date: Tue May 28 12:16:08 2024 +0200 add log_level arg to 4cat-daemon.py I've been using this forever and don't know why I haven't commited it commit fd3ac238e60f052889d99c71588170570a384900 Author: Dale Wahl <[email protected]> Date: Tue May 28 10:10:56 2024 +0200 google & clarifai to csv had identical "type" possibly caused issue w/ preset commit 1b9965d40aa33035a73f685c13a1ab50cc877f78 Author: Stijn Peeters <[email protected]> Date: Mon May 27 15:54:20 2024 +0200 Ensure file cleanup worker always exists commit 0e0917f2232e240df3412fd4df51cf0be19248b5 Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:36:22 2024 +0200 Also update Spacy model versions... commit f40128213529d154cfb77afa7aa67a72d5bb640f Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:32:35 2024 +0200 *Actually* remove typing_extensions dependency ??? commit ba3d83b824c5fb6fcb0aec5e1c36b35070d6e5d9 Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:30:08 2024 +0200 Update minimum Pillow dependency version commit 1c3485648bf2a911052eeeae4f293f303a944aec Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:27:27 2024 +0200 Do not require typing_extensions explicitly This was required to ensure Spacy could load - looks like Spacy has since been updated to work with newer versions of typing_extensions as well commit 3828de83ba123254463a904392f24daec626c136 Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:02:04 2024 +0200 Bump version commit 8f0d098107a4bbc9d55cc6048f7a38f1d1891a32 Author: Stijn Peeters <[email protected]> Date: Thu May 23 17:01:28 2024 +0200 Require non-broken version of emoji library commit 4b2ad805fcc99a83e46732fc991d98d78ef06c6c Author: Stijn Peeters <[email protected]> Date: Thu May 23 13:11:03 2024 +0200 Show worker progress in control panel if available commit 9144d4503f46108437616d6bc0cf4fde74df3aca Author: Stijn Peeters <[email protected]> Date: Thu May 23 11:07:41 2024 +0200 Bump version commit 807ab77101d197ec897640480a2140439d570c05 Author: Stijn Peeters <[email protected]> Date: Wed May 22 21:57:11 2024 +0200 Fix Instagram upload with missing media URL commit d0b4840fd465b6d21657c3d50f9291ac911b6082 Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:35:04 2024 +0200 Comma comma comma commit 7fd2e14c9505d0ed1ac77dc09c24f766ea61ee6c Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:25:26 2024 +0200 Fix progress indicator for scene extractor commit 661c42c2d083da7004335b0e14910935c3d392f6 Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:12:21 2024 +0200 Don't crash video hasher non non-str item IDs commit 1f280321cdde27a9909885fa2f64dbeffa549fb1 Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:09:53 2024 +0200 Do not crash timelines processor when metadata has unexpected format commit 572d03f1f368f0ad5f47e705a119b37646148d1d Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:09:30 2024 +0200 More efficient video frame extractor commit 1b51d224ca544d7e2913238adbff2049412bc41e Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:04:27 2024 +0200 Fix crash in video stack processor with ffmpeg < 5.1 commit ddc73cb2e2f0985e64f84ca86bc167fa9e9dc81a Author: Stijn Peeters <[email protected]> Date: Wed May 22 17:03:48 2024 +0200 Helper function for determining ffmpeg version commit ef9dd482b2258c428584997dc661156f63f68b91 Author: Stijn Peeters <[email protected]> Date: Wed May 22 12:14:58 2024 +0200 Allow absence of articleComponent in LinkedIn posts commit 060f2cd7f922e7fae337b0697f7c477442d21ef1 Author: Stijn Peeters <[email protected]> Date: Wed May 22 12:12:54 2024 +0200 Cast post IDs to string when mapping video scenes commit ab34c415c9ada23763b45676639ce3e80a34f594 Author: Stijn Peeters <[email protected]> Date: Wed May 22 11:46:39 2024 +0200 Twitter -> X/Twitter commit de6d97554ccb68375979e5ff09c7e65d8d70a6cd Author: Stijn Peeters <[email protected]> Date: Wed May 22 11:45:19 2024 +0200 Colleges -> Collages commit 30365580dc59b4d95e8a62d1b3c666bef60ce7e8 Author: Stijn Peeters <[email protected]> Date: Tue May 21 15:41:55 2024 +0200 Explicit disconnect after Telegram image download commit 5727ff7230db42463a824f45d63f0b8343caac14 Author: Stijn Peeters <[email protected]> Date: Tue May 21 14:05:50 2024 +0200 Catch TimedOutError while downloading Telegram images commit e0e06686e78976f971aac620267d7e009eaaadff Author: Sal Hagen <[email protected]> Date: Mon May 13 13:01:42 2024 +0200 Typo in LinkedIn search commit 51e58dde6ca21278a80f252a8c22dc83d87ace1f Author: Dale Wahl <[email protected]> Date: Tue May 7 13:10:43 2024 +0200 text_from_image: fix metadata missing (indent issue) commit c1f8ecc1674375bba2b2e38cb29c9d4d44098f0a Author: Dale Wahl <[email protected]> Date: Tue May 7 09:45:25 2024 +0200 text_from_image fix: ensure metadata success before attempting to update original commit 72dbf80db71499c59133e1128205b756d240b300 Merge: d7561625 baacc86b Author: Stijn Peeters <[email protected]> Date: Fri May 3 13:14:08 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit d7561625b127573fbb0332fbb713be6a3cb3d953 Author: Stijn Peeters <[email protected]> Date: Fri May 3 13:14:03 2024 +0200 Comments without replies don't always have reply_comment_total commit baacc86b269612b4b0956345f8b9fa902df1b61f Author: Dale Wahl <[email protected]> Date: Fri May 3 12:01:22 2024 +0200 DSM fix and simplify GPU mem check commit 9b662e9f9b4f4ce194608c8e20a8fc50bc6d9ae3 Author: Parker-Kasiewicz <[email protected]> Date: Thu May 2 00:53:45 2024 -0700 Adding Gab as a Data Source! (#401) * Can successfully import gab data, although can't tell if formatting is right becuase waiting on queued requests. * Version w/ different item types * Ingest Gab posts from Zeeschuimer * Small fix for merge conflicts (whoops) * Gab processing logic transferred from Zeeschuimer * fixing small errors for Gab data source * basic processing for truth social from Zeeschuimer --------- Co-authored-by: Dale Wahl <[email protected]> commit 3ecb8fd9c27aee4c457f03516794c6c4eac19c09 Author: Stijn Peeters <[email protected]> Date: Wed May 1 17:51:36 2024 +0200 Fix duplicate line in views_admin.py commit 8b66ae7e467913f8e7571cf4b45493f63804266f Author: Stijn Peeters <[email protected]> Date: Wed May 1 17:49:54 2024 +0200 Allow processors to define which fields should be pseudonymised commit c973750c8cabb8698704c5997903e92d1de866d2 Author: Stijn Peeters <[email protected]> Date: Wed May 1 17:15:32 2024 +0200 Allow auto-queue of pseudonymisation after import commit 49ad9f0ff785fd44ae494755b785c7fdf7c9cf15 Author: Stijn Peeters <[email protected]> Date: Wed May 1 17:08:35 2024 +0200 Get rid of redundant and buggy next/copy_to implementation in Search class commit 106d3659e2fda89867d3a4f587c1c1addfaff2f7 Author: Dale Wahl <[email protected]> Date: Wed May 1 16:14:03 2024 +0200 use current branch in settings commit 60bef4157d807f7c01ef3b425295244e91919f31 Author: Stijn Peeters <[email protected]> Date: Wed May 1 11:04:07 2024 +0200 Nicer code commit 4182c436e4fb5109c5e041dc729f77a58d877889 Author: Stijn Peeters <[email protected]> Date: Tue Apr 30 16:19:36 2024 +0200 Always shut down API worker only after everything else has been shut down commit e685108b3cbe5f005ce2df21906267071ad8118e Author: Stijn Peeters <[email protected]> Date: Tue Apr 30 16:12:42 2024 +0200 Properly interrupt expiration worker when asked commit 27a568eca7f2f3742223fef6285eaf80583e0fc4 Author: Stijn Peeters <[email protected]> Date: Tue Apr 30 13:40:50 2024 +0200 Allow floats-as-strings as timestamps when importing CSV commit 2d2bbb9fdb9b426b8f4a80782f04257721a97f2e Author: Dale Wahl <[email protected]> Date: Tue Apr 30 13:05:07 2024 +0200 douyin: add consistency to map_item stats commit 289aa342c9912aceeca35887c079c72aa6ffbf52 Author: Dale Wahl <[email protected]> Date: Mon Apr 29 15:26:38 2024 +0200 fix collection data in Douyin to handle $undefined commit 5b9b23fb1696bc1b69e1d902c0a2ad4b7d168984 Author: Dale Wahl <[email protected]> Date: Mon Apr 29 13:00:03 2024 +0200 add scipy requirement to make compatible with gensim https://stackoverflow.com/questions/78279136/importerror-cannot-import-name-triu-from-scipy-linalg-gensim commit 7eab746e944f1ababe3dcd6a5d25387a64c2237d Author: Stijn Peeters <[email protected]> Date: Mon Apr 29 12:00:09 2024 +0200 stupid, stupid, stupid commit 90577982ac05019a7ac76818a62f91e84dd65902 Author: Stijn Peeters <[email protected]> Date: Mon Apr 29 11:56:22 2024 +0200 Fix leftover iterate_mapped_items commit 57dbdf74c49c34c05784debb9f7e258da7ae7d54 Author: Stijn Peeters <[email protected]> Date: Fri Apr 26 15:26:39 2024 +0200 Woops commit f11760d2c13e817e23cfa5e26b24f74cf817f65e Author: Stijn Peeters <[email protected]> Date: Fri Apr 26 15:26:04 2024 +0200 Update list of supported platforms in readme commit 760ff1cdeb006f70acaa00ded82fb3cbc7617c9d Author: Stijn Peeters <[email protected]> Date: Fri Apr 26 12:13:28 2024 +0200 Bump version commit 1fd78b2362840299e80f5540c9fedc1be3b06da1 Author: Stijn Peeters <[email protected]> Date: Thu Apr 25 12:58:24 2024 +0200 Use MissingMappedField for Douyin fields undefined in the source data commit 6918baeabc7a08b6a63495c5d38c86b2c88bca44 Author: Stijn Peeters <[email protected]> Date: Thu Apr 25 12:31:11 2024 +0200 Fix Douyin mapping failure if cellRoom is $undefined commit aad6208167c07686348234daff4dcf9cd036f5a5 Author: Stijn Peeters <[email protected]> Date: Thu Apr 25 12:30:53 2024 +0200 Better error when trying to import data for unknown datasource commit 43c6ed646994111188bde66d5bcfe4ab602e8512 Author: Stijn Peeters <[email protected]> Date: Thu Apr 25 12:30:31 2024 +0200 Fix Twitter mapping on URLs that cannot be expanded commit 91c3da176fad90ba16871fa8892fac5a0df13785 Author: Stijn Peeters <[email protected]> Date: Thu Apr 25 12:12:54 2024 +0200 Safe cast to int in CrowdTangle import commit 765f29e9232afdf284ab1667b0f371951e0bf2f4 Author: Stijn Peeters <[email protected]> Date: Wed Apr 24 12:37:02 2024 +0200 Fix erroneous shell command in front-end restart trigger commit c99fdd9eca8f5925d93375cac846e8b7633194fb Merge: 342a4037 bc1deddf Author: Stijn Peeters <[email protected]> Date: Tue Apr 23 12:29:35 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 342a4037411e7ccaa50b25a4686434bec39e2568 Author: Stijn Peeters <[email protected]> Date: Tue Apr 23 12:29:32 2024 +0200 Enable TikTok comment and Gab import by default commit bc1deddf57aa5049fb79622c4309fb7051d77bdb Merge: 537d7645 3c644f01 Author: Dale Wahl <[email protected]> Date: Tue Apr 23 12:16:37 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 537d76456e2826e8c4dd7026ec5b2d436370fad8 Author: Dale Wahl <[email protected]> Date: Tue Apr 23 12:14:46 2024 +0200 do the todo: fix column_filter to match exact/contains with int commit 3c644f01baeca34e712d36efdf5c77ccd3ef7a06 Author: Stijn Peeters <[email protected]> Date: Tue Apr 23 11:16:07 2024 +0200 Don't crash on empty URLs in dataset merge commit f1574c26e2e3bdc40cc04bb8193cf6d3fa14792b Author: Dale Wahl <[email protected]> Date: Thu Apr 18 12:08:55 2024 +0200 fix: do not fail when no processor exists weird! failed on a dataset `type="custom-search"` which was created by an import script w/ no processor. Also likely would make deprecated processors fail. 500 server error: ``` File "/opt/4cat/common/lib/dataset.py", line 800, in get_columns return self.get_item_keys(processor=self.get_own_processor()) File "/opt/4cat/common/lib/dataset.py", line 405, in get_item_keys keys = list(items.__next__().keys()) File "/opt/4cat/common/lib/dataset.py", line 337, in iterate_items if own_processor.map_item_method_available(dataset=self): AttributeError: 'NoneType' object has no attribute 'map_item_method_available' ``` commit 50a4434a37d71af6a9470c7fc4a236b043cbfb4d Author: Stijn Peeters <[email protected]> Date: Wed Apr 17 14:30:58 2024 +0200 Add "TikTok comments" data source commit c43e76daae3c2e6ecdb218ee749315b985eccca4 Author: Stijn Peeters <[email protected]> Date: Tue Apr 16 17:59:25 2024 +0200 Allow notifications per tag commit 36984104e674e8577756bfc3fdd5c72f6569d9e1 Author: Dale Wahl <[email protected]> Date: Tue Apr 16 17:25:38 2024 +0200 fix: pass dataset to get_options when queuing processors commit 59cb19a3c88f7f4a4ac02d0b7a891afde50ea069 Author: Dale Wahl <[email protected]> Date: Tue Apr 16 10:55:29 2024 +0200 fix: dicts are shared in classes & you cannot delete a key more than once randomly found this; probably as no one else has reddit enabled! commit 3ec9c6ea471bcdbe9fb1caad1e5fe1502a705444 Author: Dale Wahl <[email protected]> Date: Mon Apr 15 13:22:19 2024 +0200 fix results page error when dataset was being created; do not check for resultspage updates when user not focused on page commit db05ae5e565248e865e67b8ea60e6653357bb1f4 Author: Dale Wahl <[email protected]> Date: Mon Apr 15 11:27:33 2024 +0200 on import file, differentiate between missing field(s) and unable to map item commit 940bac72c7e53bec9e136867c13e2a0a355961a4 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 12:57:48 2024 +0200 Case-insensitive username/note matching in user list commit d0f34245bd07b5ad2fd3e90754ef0264ffc350a9 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 12:29:12 2024 +0200 Only determine settings tab name in one place commit 9f69d7bc0bbb657be1e725d5fb3fe350b7205bff Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 12:20:34 2024 +0200 git != github commit 9b4981d8c7358f31ed65d9f161d556e578389801 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 11:56:04 2024 +0200 Fix issues with user tags Fix number of users in tag overview; allow filtering by user tags on user list; don't delete all user tags when deleting one commit 9e8ccd3a78765acdfd2005eaa215dc0dc07266e0 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 11:32:45 2024 +0200 Do not hide all non-hidden child processors lol commit 3f15410af3a278f5644f41f49e25498a1fac3c76 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 11:23:52 2024 +0200 Disable standard video downloader for Telegram commit 94c814b9cab2ae2be10d5c5d3f6cfe20898e349c Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 11:14:16 2024 +0200 Telegram video downloader processor commit d36254a188947fff507e8df59f793e98b3be1570 Author: Stijn Peeters <[email protected]> Date: Fri Apr 12 11:14:04 2024 +0200 Better styling for 4CAT settings, alphabetic order, submenus commit 808300fa109f306a921f2048b2cf4b6dafc4ba5f Author: Stijn Peeters <[email protected]> Date: Thu Apr 11 14:44:32 2024 +0200 Fix multiselect in UI commit 131a0eca0ad514b1ee57803e5c560ab0e56de42d Author: Stijn Peeters <[email protected]> Date: Mon Apr 8 18:28:04 2024 +0200 Do not attempt to load crashed file as module in Slack webhook. Fixes #422 (hopefully) commit 6d8cb067bc12f8be68749f74a7291e0849494225 Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 19:43:58 2024 +0200 Allow comma-separated list when adding new dataset owners commit 2612aea49f63c37ac691cc89c553c764ead2344f Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 19:40:04 2024 +0200 Include number of users with tag on tag page commit 39f2ec40faa3b8493bd5525279aeaeb2e4f586e0 Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 19:26:02 2024 +0200 Fix confirmation before deleting user tag commit b00a410a3441e7f2a9d73a9f2dfb0f4ef70ea8a5 Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 19:25:01 2024 +0200 Add link to users with tag on tag admin page commit 3ef3e5ec9adbd8ddd128ce2b3f8fa3b1de1297e3 Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 18:49:25 2024 +0200 Give filtered datasets a more sensible label, based on source dataset commit 0d5870b78fb73cb58231736cc8a2efbb0b3cd88a Author: Dale Wahl <[email protected]> Date: Fri Apr 5 17:40:57 2024 +0200 update iterate methods (#418) * working to make iterate_mapped_item primary method used by processors and elsewhere in 4CAT; iterate_item method only internally (and provide item directly as is from file) with iterate_mapped_object as intermediate method to use map_missing method and handle missing values as well as warn if needed * switch from iterate_items to iterate_mapped_items; careful attention to item_to_yield allowing a choice of the original item, the mapped item, or both * revert some unecessary renaming * fix annotations bug... this fixes the bug, but i noticed that the notations saved in the database do not have the correct post IDs. * Introduce DatasetItem class and simplify iterate_items * Don't crash when no item mapper * ...actually commit the DatasetItem class * Fix typos in comment --------- Co-authored-by: Stijn Peeters <[email protected]> Co-authored-by: Sal Hagen <[email protected]> commit 17b77351c51ace21b7057276bbae9da2643a3fc4 Author: Stijn Peeters <[email protected]> Date: Fri Apr 5 16:20:19 2024 +0200 Allow dynamic form options in processors (#397) * Allow dynamic form options in processors * Allow 'requires' on data source options as well * Handle list values with requires * Wider support for file upload in processors * Log file uploads in DMI service manager * fix error w/ datasources having file option * fix fourcat.js use of checkboxes for dynamic settings * Fix faulty toggleButton targeting --------- Co-authored-by: Dale Wahl <[email protected]> commit 693fcedc93ee4476a60d0e0876e688f82a8526fa Author: Dale Wahl <[email protected]> Date: Fri Apr 5 15:59:10 2024 +0200 Add method to processors to toggle display in UI (#411) * add ui_only parameter to DataSet.get_available_processors() and BasicProcessor.display_in_ui() Allow using `display_in_ui` to hide processors from UI but allow them to be queued either via API or presets. This avoids issue of is_compatible_with() having to be used to hide processors with sometimes ill effects. * keep same data structure.... * don't delete twice; it's redundant... and raises an error * Rename arguments/properties * Exclude hidden processors in top level view * fix logic * Exclude in child template as well --------- Co-authored-by: Stijn Peeters <[email protected]> commit 3cd146c2908da6b3a06a0c1511bf042c4223af0f Author: Dale Wahl <[email protected]> Date: Thu Apr 4 16:41:39 2024 +0200 fix: whoops remove debug commit daa7291e813e62fed4600a4acb8430004836cb86 Author: Dale Wahl <[email protected]> Date: Thu Apr 4 15:16:30 2024 +0200 CSV preview add hyperlinks if "url" or "link" in column header commit 5f2d6e65bad4f71b2c3cc75d2cdab76f15671d4c Author: Dale Wahl <[email protected]> Date: Thu Apr 4 15:16:01 2024 +0200 blip2 processor to work w/ DMI Service Manager commit fe881dec18778d99ac4a0f60ca40a1f43fdb1689 Author: Dale Wahl <[email protected]> Date: Thu Apr 4 09:53:30 2024 +0200 catch AttributeError on slackhook if unable to read file ever vigilant against a lack of flavour... commit 2808256b1fabf2e6e8a5a94aad98af60c50fb7b0 Merge: 14123847 eb474640 Author: Dale Wahl <[email protected]> Date: Wed Apr 3 17:28:40 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 14123847b5852bf0e7c84fced6c2380165ec93f6 Author: Dale Wahl <[email protected]> Date: Wed Apr 3 17:28:38 2024 +0200 staging_areas should not be made for completed datasets (else they may be deleted prematurely) commit eb474640559ee3e914d9c95adb60be09b906f1d6 Merge: bbdf2ab9 3f8b285c Author: sal-phd-desktop <[email protected]> Date: Wed Apr 3 16:50:54 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit bbdf2ab9b4292c14911ac01b481c829defa85e5c Author: sal-phd-desktop <[email protected]> Date: Wed Apr 3 16:50:36 2024 +0200 Helper script to export the 'classic' 4CAT 4chan data commit 3f8b285c44c33a3ce08e885889b311bc454a70ea Merge: 8f40f3f5 f7cc5b8d Author: Sal Hagen <[email protected]> Date: Wed Apr 3 12:12:17 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 8f40f3f5222a63e93f46eb3b57791d10060a0cc8 Author: Sal Hagen <[email protected]> Date: Wed Apr 3 12:12:13 2024 +0200 Tumblr search typo commit f7cc5b8d012dec3d8e0c8847ae16c662e82040b5 Author: Stijn Peeters <[email protected]> Date: Tue Apr 2 12:32:51 2024 +0200 More/less flavour in restart worker commit 073587efc581adca0608988573ac83ea8b0c93d0 Author: Dale Wahl <[email protected]> Date: Wed Mar 27 14:15:27 2024 +0100 create favicon.ico (remove from repo) be sure to keep webtool/static/img/favicon/favicon-bw.ico as basis commit 28d733d56204231f4089660ff61282174aac7aed Author: Dale Wahl <[email protected]> Date: Wed Mar 27 09:44:45 2024 +0100 add allow_access_request check to request-password page clicking it would only return the user to the login page anyway, but better not even show it commit 1f2cb77e3cb0fc9b5403da52aaa925b33089d18f Author: Dale Wahl <[email protected]> Date: Wed Mar 27 09:37:51 2024 +0100 fix can_request_access to use 4cat.allow_access_request option commit 0d66f11d3619af798d5acc41dbf4fe118b7ddad8 Merge: 25825383 05b3fc07 Author: Stijn Peeters <[email protected]> Date: Tue Mar 26 17:54:48 2024 +0100 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 2582538303e31470ed6bf8a01645f7b45af15e5d Author: Stijn Peeters <[email protected]> Date: Tue Mar 26 17:54:45 2024 +0100 More permissive timeout for pixplot commit 05b3fc0771ded10dc55db799e8f47e42add08d43 Author: Dale Wahl <[email protected]> Date: Tue Mar 26 14:01:59 2024 +0100 remove redundant call of Path commit e4a93442efb84d73d6a4c9af9bc46a8f3e3fdda2 Author: Stijn Peeters <[email protected]> Date: Tue Mar 26 11:52:09 2024 +0100 Include column with link description in Telegram mapping commit 876f4a4b6df51ec4b30a048c32191438b6778f90 Author: Dale Wahl <[email protected]> Date: Mon Mar 25 14:48:47 2024 +0100 douyin handle image posts commit 81ad61baabaf965b1c848f55a80c23bd3e1a9000 Author: Stijn Peeters <[email protected]> Date: Mon Mar 25 08:01:44 2024 +0100 Accept non-numeric IDs in Telegram image downloader commit a8b36dc5682df7c16e25474ea8fdbfc4f12f9d46 Author: Stijn Peeters <[email protected]> Date: Sun Mar 24 23:15:51 2024 +0100 Ensure unique IDs for Telegram datasets commit 4a3e9ffee072c4d3efb7bfd8744369b46f19eef2 Merge: 0c119130 d749237e Author: Stijn Peeters <[email protected]> Date: Sun Mar 24 22:56:59 2024 +0100 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 0c11913049aabb5a83ffe26d58bdf17affdbc0b9 Author: Stijn Peeters <[email protected]> Date: Sun Mar 24 20:09:10 2024 +0100 Better string formatting in Telegram image downloader commit 8a7da5317defdafb5bdbf74dcbeb68e464fa21f4 Author: Stijn Peeters <[email protected]> Date: Sun Mar 24 20:06:06 2024 +0100 Add 'link thumbnails' op…
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Found a couple issues. The most fun one was that, while we were checking to see if discovered entities were already in the to-do queries or one of the original queries, we did not update the full list of queries (and kept popping queries out of the to-do queries list). I unknowingly left my computer stuck in a loop overnight between two channels referencing themselves.
Related, we should be yielding and not storing all posts in memory; if you were to do some serious crawling (or stuck in a loop), your computer might crash. 😂 I am not an
asyncio
savant, but they do offer asynchronous generators. I may take a look at addressing this later.I also fixed the check to identify whether an entity hit the requirements (looks like some variable names may have been swapped and only worked under certain circumstances). I created a PR though because it looks like the data structure may have changed and I could not find
_type
as originally used to select the correct forward types. Right now it is capturing all types of forwards but I am not sure if that is desired behavior from the notes in the code.