Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegram crawl improvements #444

Merged
merged 2 commits into from
Sep 23, 2024
Merged

telegram crawl improvements #444

merged 2 commits into from
Sep 23, 2024

Conversation

dale-wahl
Copy link
Member

@dale-wahl dale-wahl commented Jul 23, 2024

Found a couple issues. The most fun one was that, while we were checking to see if discovered entities were already in the to-do queries or one of the original queries, we did not update the full list of queries (and kept popping queries out of the to-do queries list). I unknowingly left my computer stuck in a loop overnight between two channels referencing themselves.

Related, we should be yielding and not storing all posts in memory; if you were to do some serious crawling (or stuck in a loop), your computer might crash. 😂 I am not an asyncio savant, but they do offer asynchronous generators. I may take a look at addressing this later.

I also fixed the check to identify whether an entity hit the requirements (looks like some variable names may have been swapped and only worked under certain circumstances). I created a PR though because it looks like the data structure may have changed and I could not find _type as originally used to select the correct forward types. Right now it is capturing all types of forwards but I am not sure if that is desired behavior from the notes in the code.

commit dd85961696de3d01fa48cfbbac8a31a4374edc83
Author: sal-phd-desktop <[email protected]>
Date:   Mon Sep 23 14:37:50 2024 +0200

    Only import bsky embed JS on front page, make divs wider

commit 02f90bd1559d710360324e1dca116e8c5519f9fe
Author: sal-phd-desktop <[email protected]>
Date:   Fri Sep 20 15:03:09 2024 +0200

    Link to Bluesky in readme

commit e675dd04a9ffb45cc72704763b7553fee6cf59a2
Merge: 070035eb 38418b2e
Author: sal-phd-desktop <[email protected]>
Date:   Fri Sep 20 15:01:45 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 070035ebf0cf4065f32f00e78044bb24a22172bd
Author: sal-phd-desktop <[email protected]>
Date:   Fri Sep 20 15:01:31 2024 +0200

    Link to Bluesky in readme

commit 38418b2ec1533f5e13c8d3f001903db0bfdab4af
Author: Sal Hagen <[email protected]>
Date:   Thu Sep 19 17:27:00 2024 +0200

    Host BlueSky widget ourselves

commit e281eb8bdfad3ec4c800bec2a64e6ff3263a2f74
Author: Stijn Peeters <[email protected]>
Date:   Thu Sep 19 15:32:08 2024 +0200

    Refactor module loading (#396)

    * Refactor module loading

    * Optionally inject modules when instantiating dataset object

    * pass modules in a few more places where possible

    I think that is everywhere in the frontend.

    Backend is a bit odd as we are passing dataset.modules when it is None and thus creating children that would require individual inits of ModuleCollector. Could be more to look at there.

    * Do not lazy-load modules

    * modules/all_modules

    * Squashed commit of the following:

    commit 3f2a62a124926cfeb840796f104a702878ac10e5
    Author: Carsten Schnober <[email protected]>
    Date:   Wed Sep 18 18:18:29 2024 +0200

        Update Gensim to >=4.3.3, <4.4.0 (#450)

        * Update Gensim to >=4.3.3, <4.4.0

        * update nltk as well

        ---------

        Co-authored-by: Dale Wahl <[email protected]>
        Co-authored-by: Sal Hagen <[email protected]>

    commit fee2c8c08617094f28496963da282d2e2dddeab7
    Merge: 3d94b666 f8e93eda
    Author: sal-phd-desktop <[email protected]>
    Date:   Wed Sep 18 18:11:19 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 3d94b666cedd0de4e0bee953cbf1d787fdc38854
    Author: sal-phd-desktop <[email protected]>
    Date:   Wed Sep 18 18:11:04 2024 +0200

        FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page)

    commit f8e93edabe9013a2c1229caa4c454fab09620125
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 15:11:21 2024 +0200

        Simple extensions page in Control Panel

    commit b5be128c7b8682fb233d962326d9118a61053165
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 14:08:13 2024 +0200

        Remove 'docs' directory

    commit 1e2010af44817016c274c9ec9f7f9971deb57f66
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 14:07:38 2024 +0200

        Forgot TikTok and Douyin

    commit c757dd51884e7ec9cf62ca1726feacab4b2283b7
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 14:01:31 2024 +0200

        Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions

    commit ee7f4345478f923541536c86a5b06246deae03f6
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 14:00:40 2024 +0200

        RIP Parler data source

    commit 11300f2430b51887823b280405de4ded4f15ede1
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 11:21:37 2024 +0200

        Tuplestring

    commit 547265240eba81ca0ad270cd3c536a2b1dcf512d
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Sep 18 11:15:29 2024 +0200

        Pass user obj instead of str to ConfigWrapper in Processor

    commit b21866d7900b5d20ed6ce61ee9aff50f3c0df910
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Sep 17 17:45:01 2024 +0200

        Ensure request-aware config reader in user object when using config wrapper

    commit bbe79e4b0fe870ccc36cab7bfe7963b28d1948e3
    Author: Sal Hagen <[email protected]>
    Date:   Tue Sep 17 15:12:46 2024 +0200

        Fix extension path walk for Windows

    commit d6064beaf31a6a85b0e34ed4f8126eb4c4fc07e3
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Sep 16 14:50:45 2024 +0200

        Allow tags that have no users

        Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy

    commit b542ded6f976809ec88445e7b04f2c81b900188e
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Sep 16 14:13:14 2024 +0200

        Trailing slash in query results list

    commit a4bddae575b22a009925206a1337bdd89349e567
    Author: Dale Wahl <[email protected]>
    Date:   Mon Sep 16 13:57:23 2024 +0200

        4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451)

        * domain only

        * fix reference

        * try and collect links with selenium

        * update column_filter to find multiple matches

        * fix up the normal url_scraper datasource

        * ensure all selenium links are strings for join

        * change output of url_scraper to ndjson with map_items

        * missed key/index change

        * update web archive to use json and map to 4CAT

        * fix no text found

        * and none on scraped_links

        * check key first

        * fix up web_archive error reporting

        * handle None type for error

        * record web archive "bad request"

        * add wait after redirect movement

        * increase waittime for redirects

        * add processor for trackers

        * dict to list for addition

        * allow both newline and comma seperated links

        * attempt to scrape iframes as seperate pages

        * Fixes for selenium scraper to work with config database

        * installation of packages, geckodriver, and firefox if selenium enabled

        * update install instructions

        * fix merge error

        * fix dropped function

        * have to be kidding me

        * add note; setup requires docker... need to think about IF this will ever
        be installed without Docker

        * seperate selenium class into wrapper and Search class so wrapper can be
        used in processors!

        * add screenshots; add firefox extension support

        * update selenium definitions

        * regex for extracting urls from strings

        * screenshots processor; extract urls from text and takes screenshots

        * Allow producing zip files from data sources

        * import time

        * pick better default

        * test screenshot datasource

        * validate all params

        * fix enable extension

        * haha break out of while loop

        * count my items

        * whoops, len() is important here

        * must be getting tired...

        * remove redundant logging

        * Eager loading for screenshots, viewport options, etc

        * Woops, wrong folder

        * Fix label shortening

        * Just 'queue' instead of 'search queue'

        * Yeah, make it headless

        * README -> DESCRIPTION

        * h1 -> h2

        * Actually just have no header

        * Use proper filename for downloaded files

        * Configure whether to offer pseudonymisation etc

        * Tweak descriptions

        * fix log missing data

        * add columns to post_topic_matrix

        * fix breadcrumb bug

        * Add top topics column

        * Fix selenium config install parameter (Docker uses this/manual would
        need to run install_selenium, well, manually)

        * this processor is slow; i thought it was broken long before it updated!

        * refactor detect_trackers as conversion processor not filter

        * add geckodriver executable to docker install

        * Auto-configure webdrivers if available in PATH

        * update screenshots to act as image-downloader and benefit from processors

        * fix is_compatible_with

        * Delete helper-scripts/migrate/migrate-1.30-1.31.py

        * fix embeddings is_compatible_with

        * fix up UI options for hashing and private

        * abstract was moved to lib

        * various fixes to selenium based datasources

        * processors not compatible with image datasets

        * update firefox extension handling

        * screenshots datasource fix get_options

        * rename screenshots processor to be detected as image dataset

        * add monthly and weekly frequencies to wayback machine datasource

        * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily

        * add scroll down page to allow lazy loading for entire page screenshots

        * screenshots: adjust pause time so it can be used to force a wait for images to load

        I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine

        * hash URLs to create filenames

        * remove log

        * add setting to toggle display advanced options

        * add progress bars

        * web archive fix query validation

        * count subpages in progress

        * remove overwritten function

        * move http response to own column

        * special filenames

        * add timestamps to all screenshots

        * restart selenium on failure

        * new build have selenium

        * process urls after start (keep original query parameters)

        * undo default firefox

        * quick max

        * rename SeleniumScraper to SeleniumSearch

        todo: build SeleniumProcessor!

        * max number screenshots configurable

        * method to get url with error handling

        * use get_with_error_handling

        * d'oh, screenshot processor needs to quit selenium

        * update log to contain URL

        * Update scrolling to use Page down key if necessary

        * improve logs

        * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way.

        Also, could I get categories from the metadata? That's... ugh.

        * no category, no processor

        * str errors

        * screenshots: dismiss alerts when checking ready state is complete

        * set screenshot timeout to 30 seconds

        * update gensim package

        * screenshots: move processor interrupt into attempts loop

        * if alert disappears before we can dismiss it...

        * selenium specific logger

        * do not switch window when no alert found on dismiss

        * extract wait for page to load to selenium class

        * improve descriptions of screenshot options

        * remove unused line

        * treat timeouts differently from other errors

        these are more likely due to an issue with the website in question

        * debug if requested

        * increase pause time

        * restart browser w/ PID

        * increase max_workers for selenium

        this is by individual worker class not for all selenium classes... so you can really crank them out if desired

        * quick fix restart by pid

        * avoid bad urls

        * missing bracket & attempt to fix-missing dependencies in Docker install

        * Allow dynamic form options in processors

        * Allow 'requires' on data source options as well

        * Handle list values with requires

        * basic processor for apple store; setup checks for additional requirements

        * fix is_4cat_class

        * show preview when no map_item

        * add google store datasource

        * Docker setup.py use extensions

        * Wider support for file upload in processors

        * Log file uploads in DMI service manager

        * add map_item methods and record more data per item

        need additional item data as map_item is staticmethod

        * update from master; merge conflicts

        * fix docker build context (ignore data files)

        * fix option requirements

        * apple store fix: list still tries to get query

        * apple & google stores fix up item mapping

        * missed merge error

        * minor fix

        * remove unused import

        * fix datasources w/ files frontend error

        * fix error w/ datasources having file option

        * better way to name docker volumes

        * update two other docker compose files

        * fix docker-compose ymls

        * minor bug: fix and add warning; fix no results fail

        * update apple field names to better match interface

        * update google store fieldnames and order

        * sneak in jinja logger if needed

        * fix fourcat.js handling checkboxes for dynamic settings

        * add new endpoint for app details to apple store

        * apple_store map new beta app data

        * add default lang/country

        * not all apps have advisories

        * revert so button works

        * add chart positions to beta map items

        * basic scheduler

        To-do
        - fix up and add options to scheduler view (e.g. delete/change)
        - add scheduler view to navigator
        - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view)
        - more testing...

        * update scheduler view, add functions to update job interval

        * revert .env

        * working scheduler!

        * basic scheduler view w/ datasets

        * fix postgres tag

        * update job status in scheduled_jobs table

        * fix timestamp; end_date needed for last run check; add dataset label

        * improve scheduler view

        * remove dataset from scheduled_jobs table on delete

        * scheduler view order by last creation

        * scheduler views: separate scheduler list from scheduled dataset list

        * additional update from master fixes

        * apple_store map_items fix missing locales

        * add back depth for pagination

        * correct route

        * modify pagination to accept args

        * pagination fun

        * pagination: i hate testing on live servers...

        * ok ok need the pagination route

        * pagination: add route_args

        * fix up scheduler header

        * improve app store descriptions

        * add azure store

        * fix azure links

        * azure_store: add category search

        * azure fix type of config update timestamp

        OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly

        * basic aws store

        * check if selenium available; get correct app_id

        * aws: implement pagination

        * add logging; wait for elements to load after next page; attempts to rework filter option collection

        * apple_store: handle invalid param error

        * fix filter_options

        * aws: fix filter option collection!

        * more merge

        * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations

        * migrate.py to run extension "fourcat_install.py" files

        * formatting

        * remove extensions; add gitignore

        * excise scheduler merge

        * some additional cleanup from app_studies branch

        * allow nested datasources folders; ignore files in extensions main folder

        * allow extension install scripts to run pip if migrate.py has not

        * Remove unused URL functions we could use ural for

        * Take care of git commit hash tracking for extension processors

        * Get rid of unused path.versionfile config setting

        * Add extensions README

        * Squashed commit of the following:

        commit cd356f7a69d15e8ecc8efffc6d63a16368e62962
        Author: Stijn Peeters <[email protected]>
        Date:   Sat Sep 14 17:36:18 2024 +0200

            UI setting for 4CAT install ad in login

        commit 0945d8c0a11803a6bb411f15099d50fea25f10ab
        Author: Stijn Peeters <[email protected]>
        Date:   Sat Sep 14 17:32:55 2024 +0200

            UI setting for anonymisation controls

            Todo: make per-datasource

        commit 1a2562c2f9a368dbe0fc03264fb387e44313213b
        Author: Stijn Peeters <[email protected]>
        Date:   Sat Sep 14 15:53:27 2024 +0200

            Debug panel for HTTP headers in control panel

        commit 203314ec83fb631d985926a0b5c5c440cfaba9aa
        Author: Stijn Peeters <[email protected]>
        Date:   Sat Sep 14 15:53:17 2024 +0200

            Preview for HTML datasets

        commit 48c20c2ebac382bd41b92da4481ff7d832dc1538
        Author: Desktop Sal <[email protected]>
        Date:   Wed Sep 11 13:54:23 2024 +0200

            Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

        commit 657ffd75a7f48ba4537449127e5fa39debf4fdf3
        Author: Dale Wahl <[email protected]>
        Date:   Fri Sep 6 16:29:19 2024 +0200

            fix nltk where it matters

        commit 2ef5c80f2d1a5b5f893c8977d8394740de6d796d
        Author: Stijn Peeters <[email protected]>
        Date:   Tue Sep 3 12:05:14 2024 +0200

            Actually check progress in text annotator

        commit 693960f41b73e39eda0c2f23eb361c18bde632cd
        Author: Stijn Peeters <[email protected]>
        Date:   Mon Sep 2 18:03:18 2024 +0200

            Add processor for stormtrooper DMI service

        commit 6ae964aad492527bc5d016a00f870145aab6e1af
        Author: Stijn Peeters <[email protected]>
        Date:   Fri Aug 30 17:31:37 2024 +0200

            Fix reference to old stopwords list in neologisms preset

        * Fix Github links for extensions

        * Fix commit detection in extensions

        * Fix extension detection in module loader

        * Follow symlinks when loading extensions

        Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir

        * Make queue message on create page more generic

        * Markdown in datasource option tooltips

        * Remove Spacy model from requirements

        * Add software_source to database SQL

        ---------

        Co-authored-by: Stijn Peeters <[email protected]>
        Co-authored-by: Stijn Peeters <[email protected]>

    commit cd356f7a69d15e8ecc8efffc6d63a16368e62962
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:36:18 2024 +0200

        UI setting for 4CAT install ad in login

    commit 0945d8c0a11803a6bb411f15099d50fea25f10ab
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:32:55 2024 +0200

        UI setting for anonymisation controls

        Todo: make per-datasource

    commit 1a2562c2f9a368dbe0fc03264fb387e44313213b
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:27 2024 +0200

        Debug panel for HTTP headers in control panel

    commit 203314ec83fb631d985926a0b5c5c440cfaba9aa
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:17 2024 +0200

        Preview for HTML datasets

    commit 48c20c2ebac382bd41b92da4481ff7d832dc1538
    Author: Desktop Sal <[email protected]>
    Date:   Wed Sep 11 13:54:23 2024 +0200

        Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

    commit 657ffd75a7f48ba4537449127e5fa39debf4fdf3
    Author: Dale Wahl <[email protected]>
    Date:   Fri Sep 6 16:29:19 2024 +0200

        fix nltk where it matters

    commit 2ef5c80f2d1a5b5f893c8977d8394740de6d796d
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Sep 3 12:05:14 2024 +0200

        Actually check progress in text annotator

    commit 693960f41b73e39eda0c2f23eb361c18bde632cd
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Sep 2 18:03:18 2024 +0200

        Add processor for stormtrooper DMI service

    commit 6ae964aad492527bc5d016a00f870145aab6e1af
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Aug 30 17:31:37 2024 +0200

        Fix reference to old stopwords list in neologisms preset

    commit 4ba872bef2968f7f8bf5831fd3a4f413420b36ed
    Author: Dale Wahl <[email protected]>
    Date:   Tue Aug 27 13:04:46 2024 +0200

        fix hatebase: default column option for OPTION_MULTI_SELECT must be list

    commit e276033542f2d22e7f614f318a01d65114a21482
    Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Date:   Wed Aug 21 12:53:10 2024 +0200

        Bump nltk from 3.6.7 to 3.9 (#447)

        Bumps [nltk](https://github.com/nltk/nltk) from 3.6.7 to 3.9.
        - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
        - [Commits](https://github.com/nltk/nltk/compare/3.6.7...3.9)

        ---
        updated-dependencies:
        - dependency-name: nltk
          dependency-type: direct:production
        ...

        Signed-off-by: dependabot[bot] <[email protected]>
        Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

    commit 1d749c3cf83b130ba70bdb09174f382d6711a14b
    Author: sal-phd-desktop <[email protected]>
    Date:   Wed Aug 21 12:52:54 2024 +0200

        Set UTF-8 encoding when opening stop words (fixes Windows bug)

    commit a03e5fd4252e7242563c291558606440256eb3d1
    Author: Dale Wahl <[email protected]>
    Date:   Mon Aug 19 14:19:21 2024 +0200

        remove duplicate line

    commit aa07e8c13c2d59c6b699f78133036514659ee420
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 29 09:35:22 2024 +0200

        tweet import fix: author banner key missing when author has no banner

    commit 32dac5d2ffb936210f12f5c725514fd25a0286f1
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 29 08:52:08 2024 +0200

        tell user when dataset is not found

        we could have a proper 404 page, but at least leave a message

    commit 2c8c860fc5378113d1352016ac26ca761adecb32
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 22 17:41:00 2024 +0200

        telegram fix: reactions datastructure

    commit 1c0bf5e580eb16d8a6f9afa415f9febce449a537
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 22 11:19:52 2024 +0200

        fix telegram: crawl_max_depth can be None if it is not enabled for a user

    commit 3dfe7af292b33574a31630e3a0da10954ed87d0a
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 19 11:52:31 2024 +0200

        fix more config.get() magic

    commit 2453182bcee6e54b396b762ab77b60b8a0893638
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 19 10:54:23 2024 +0200

        config_manager - fix `get_all` w/ one results (super rare edge); fix overwriting self.db in `with_db`

    commit 6b9cb0b5479e6e64e09a49fa2ca9effe1c5a7415
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jul 17 15:20:49 2024 +0200

        add surf nginx init file

    commit 5e984e13a08d9fba7d5806a7ef4e012ce7d57319
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jul 17 14:30:34 2024 +0200

        change port for surf

    commit 2ce8c354e90f939a16dad3f0155fd7d79405c79e
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jul 17 12:54:11 2024 +0200

        use latest image on surf

    commit 13ec0fd3f2bed86c3b2dff73014093a6a92fbfb5
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jul 17 12:46:59 2024 +0200

        update surf docker-compose.yml

         this may require a new release

    commit 78698f6ac1b22b1154d31f69543ba7b266d33191
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jul 17 10:34:56 2024 +0200

        clip: handle new and old format

    commit eb7693780cb191403f107817ca30d90373929bf0
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 16 14:27:08 2024 +0200

        DMI SM updates to use status endpoint w/ database records; run on CPU if no GPU enabled

    commit d2a787e2c1559417bb5401f3208c82954052504f
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jul 15 15:58:06 2024 +0200

        Require most recent Telethon version

    commit 346150bd9cc96ac099abd4d15fa3de39bd65e9d1
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jul 15 15:57:55 2024 +0200

        Catch UPDATE_APP_TO_LOGIN in Telegram

    commit 04acc06e95098d7e2f9b4af404447c9cfaee5b99
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jul 15 11:27:30 2024 +0200

        Unbreak Twitter error handling

    commit e9b5232a963be02c2e86dabacb607b2315a4e0e6
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 12 13:27:15 2024 +0200

        Ensure str type when trying to extract video URLs from a field

    commit d69dd6f337cac05ed31c05334890679976a1e6de
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 12 12:31:14 2024 +0200

        Make CSV column mapping params look nicer on result page

    commit 9bd9da568f593085a8d54744836e3290a75b51a7
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 12 12:22:03 2024 +0200

        Add "empty" and "current timestamp" as options to CSV mapping

    commit 0b574571952a206904440faf8601ddf95ab42b24
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jul 11 16:59:56 2024 +0200

        image_wall: backup fit method

    commit eeb1ddeb7ca85b6802dfed3c74d1352062383d50
    Merge: 2504c37b 43239467
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Jul 11 16:47:45 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 43239467db046eea5eb5268f91d1b63a1042238d
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jul 11 12:08:08 2024 +0200

        fix processor more button

        would only show top level analysis if not logged in

    commit d6ab2b0783f8e40ecd8fadbc2abccffa6f093e39
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 9 15:35:25 2024 +0200

        search_gab - use MappedItem

    commit 2504c37b67ff6f19720b44d8bb6054b1c3d5a155
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Jul 6 17:51:22 2024 +0200

        Fix multiline spacing in multi select list

    commit fea66ce38be0717da6c1f847e7124f7069c096e2
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 13:15:45 2024 +0200

        use processor media_type if dataset does not have media_type; set default media_type for downloaders

    commit d41fa34514e8177efdac7e64a31f2ee75c7d1652
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 12:57:18 2024 +0200

        video_hasher: handle no metadata file

    commit 2820dcecc36ed4705a2776064d387ff7ed14e84f
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 12:50:09 2024 +0200

        num_rows not num_items()

    commit fb09162db902fa22fdf2d7a3ed171ce1489bd92f
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 12:44:03 2024 +0200

        Google vision API returning 400s; properly log and record processed entries; google networks should not run on empty datasets

    commit ebf39d8262d199895aedc4f7fa275c5685e58563
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 12:28:13 2024 +0200

        fix image_category_wall

        whoops, cleared categories and post_values after filling them!

    commit 1ad9ec2c2e76604793ec37584c051f116af2fdab
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 5 12:03:54 2024 +0200

        fsdfdsgd sorry

    commit c7254c08a477c6cdc8497507e8452c3eff7101c9
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 5 12:01:21 2024 +0200

        Fix razdel versioning

    commit b9a327abe99f2d9ede4f2747f34f20d1dc6803cb
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 5 11:57:47 2024 +0200

        Reorganise tokeniser, stopwords

    commit fb13bc483af9ba0d677ee35fd045bf36ab1cddf7
    Merge: 0b745692 e3046496
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Jul 5 11:56:08 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit e30464964262870c54c73f65a3bce630d6576f45
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 10:51:53 2024 +0200

        media_upload allow setting for max_form_part and warn users of failure above certain number of files

    commit e4f982b4550b352a5d1a131abd78d52e6c196e48
    Author: Dale Wahl <[email protected]>
    Date:   Fri Jul 5 09:50:49 2024 +0200

        Update media_import help text; looks like failure happens somewhere between 600-1000 files due to Flask request size limits

    commit 0b74569280f8f87376a964a6b160ea1993cb3354
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Jul 4 17:55:36 2024 +0200

        Add razdel as option for Russian tokenisation

    commit 9f15a2b8d666c3b6fddeb151b7c424cb44df18a6
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jul 4 17:13:15 2024 +0200

        remove the log

    commit ffcb6a4239075ba190fb534b25b89507e09e5f56
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jul 4 17:12:43 2024 +0200

        Inform user if too many files are uploaded

        I do not understand why this is appearing. app.config['MAX_CONTENT_LENGTH'] is set to None. Problem persists in Flask alone (i.e., does not appear to be Gunicorn/Nginx/Apache).

    commit 9cad12dd6f64a63c48d3b5b304b5c7d9d1a6ddb7
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Jul 4 15:09:42 2024 +0200

        Bump version

    commit aad94f393de77cc9d4f578e1f5be66a3601a4c90
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jul 4 10:51:01 2024 +0200

        Update setup.py to ensure videohash updates

    commit d9154a6f9c46a5c793909b88da751bc71d6f759f
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 17:45:26 2024 +0200

        clip: categorizing requires categories...

        seriously, guys?

    commit 0af9a5ec49bd2bcfbb87bda33976c65683f68777
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 17:31:49 2024 +0200

        blip2: fix no metadata file found (uploads...)

    commit d695053f440bd938a57f06adea7b9c732ecf30d7
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 17:25:26 2024 +0200

        cat_vis_wall - use str as category type if mixed

        i.e., use floats as string categories

    commit bcb914076760ea1fb0e277cdcd1782ffa101b535
    Author: Sal Hagen <[email protected]>
    Date:   Tue Jul 2 16:06:43 2024 +0200

        Add Twitter author profile pic and banner URLs

    commit 1b3b02f826578e8f702ea84a27c8ced7b1fab345
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 11:42:50 2024 +0200

        add migrate.py log file in Docker

    commit 2aaa972e6888743fc329d721c37fa626cf2eeae3
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 11:42:22 2024 +0200

        add necessary pip packages for upgrade in Docker environment; add error logging and save to file for trouble shooting

    commit 18b8a53c01b334e0f70610b1305d380b25dbe9c6
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 11:41:36 2024 +0200

        update Dockerfile to keep build environment

        useful for interactive upgrade

    commit 7b224b9b798c9aaf956b5b618b98d742c4a2e7cd
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jul 2 11:41:12 2024 +0200

        remove docker-compose.yml versions

    commit acf5de0ed02e144b920a80abfdfa35986dd0ed4c
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jul 1 17:38:32 2024 +0200

        Better issues.md, footer link

    commit 1953ca3895656ca9a12d2657e58019795ae64b3a
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 1 12:00:07 2024 +0200

        FIX: get_key() is more of a creating of a key then general getting of a key...

    commit 12289bb5c766d1af23799ff11278b46b48fc2841
    Author: Dale Wahl <[email protected]>
    Date:   Mon Jul 1 11:37:06 2024 +0200

        .metadata.json may not have top_parent via Media Uploader

        This may exist in other processors if a proper check is not in place; will need to review

    commit 25f4ed65ec2c32298a90490cf51037a7ea2d0bf9
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jun 25 14:43:40 2024 +0200

        Media upload datasource! (#419)

        * basic changes to allow files box

        * basic imports, yay!

        * video_scene_timelines to work on video imports!

        * add is_compatible_with checks to processors that cannot run on new media top_datasets

        * more is_compatible fixes

        * necessary function for checking media_types

        * enable more processors on media datasets

        * consolidate user_input file type

        * detect mimetype from filename

        best I can do without downloading all the files first.

        * handle zip archives; allow log and metadata files

        * do not count metadata or log files in num_files

        * move machine learning processors so they can be imported elsewhere

        * audio_to_text datasource

        * When validating zip file uploads, send list of file attributes instead of the first 128K of the zip file

        * Check type of files in zip when uploading media

        * Skip useless files when uploading media as zip

        * check multiple zip types in JS

        * js !=== python

        * fix media_type for loose file imports; fix extension for audio_to_text preset; fix merge for some processors w/ media_type

        ---------

        Co-authored-by: Stijn Peeters <[email protected]>

    commit 4ce689bdc3e441a7adf85883ddcda6bae0525ed9
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jun 24 11:58:50 2024 +0200

        Avoid KeyError

    commit 155522d0817d19ac7b6b0b0164242156d6f7443a
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 20 15:58:21 2024 +0200

        add generated images to image wall w/ text visual

    commit eecde519eab1208eeb6ee53c2d8febff7fb8febf
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 20 15:57:56 2024 +0200

        allow users to NOT generate all images from prompts

    commit d0b9574093a109997e63b1062b2bdd8e71300a29
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Jun 19 16:28:26 2024 +0200

        ...don't mangle URLs in preview links

    commit c105e368a521ec54ae717bb9eb2fe9fae66cf6e8
    Merge: 0028a999 8d4f99b2
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jun 19 16:25:36 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 0028a9994d698611dd8b546b9b3bccbeec30b74f
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jun 19 16:25:12 2024 +0200

        add followups to processors

    commit 8d4f99b22e0308606c7f713ef704dfa939e85247
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Jun 19 16:17:22 2024 +0200

        More flexible URL linking in CSV preview

    commit f4f8e6621bd6f2504dc3afc2078280bf5edb6444
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jun 19 13:54:00 2024 +0200

        tokeniser fix: use default lang for word_tokenize if language is 'other'

    commit 127472e91d8e510f3de2a9cc4a87be6cf2d0deaa
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Jun 18 16:45:01 2024 +0200

        Better log messages for Telegram data source

    commit e8714b6fba72e00c690a8d643d8dc54d2250c94a
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Jun 17 17:42:21 2024 +0200

        Add 'crawl' feature to Telegram data source

        Fixes #321 (though might need a bit more testing)

    commit 25fded7b596097f7916e1793f1841bae2b63d453
    Merge: d67cf440 b10e3bb8
    Author: sal-phd-desktop <[email protected]>
    Date:   Fri Jun 14 16:23:02 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit d67cf440730ea1d4e124c76a4c21d65b56f39c68
    Author: sal-phd-desktop <[email protected]>
    Date:   Fri Jun 14 16:22:59 2024 +0200

        Fix export 4chan script and remove some unecessary code

    commit b10e3bb8f0c8a67aa5fdbba1962301d8acdf625c
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 13 15:14:06 2024 +0200

        video_hasher prefix: fix extension type

    commit ba565cdaa2ebeecf23fd60889d546c76b9ea5eb1
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 13 14:53:13 2024 +0200

        video_hasher: fix to work with Pillow updates; add max amount videos

    commit 90da5d231eff6a4249bef5468fcdbf1ebcf9247a
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 13 10:25:24 2024 +0200

        image_cat_wall fix the fix

    commit a8b943d8e2c5471f82ea0442e2659d84fe8d9760
    Author: Dale Wahl <[email protected]>
    Date:   Wed Jun 12 13:29:41 2024 +0200

        add OCR processor to image w/ text visualization

    commit e7e636b6b89b6163fa6976e67edba68e7d75b7ac
    Author: Dale Wahl <[email protected]>
    Date:   Tue Jun 11 15:23:12 2024 +0200

        add image_wall_w_text to follow on BLIP captions

    commit f74b97827f0465baf8483040471a77e4654e70b1
    Author: Dale Wahl <[email protected]>
    Date:   Thu Jun 6 11:05:25 2024 +0200

        image_category_wall: allow multiple images per item/post

    commit e3c9ea57d46b32ba47b00a6047a278ddd530adc1
    Author: Dale Wahl <[email protected]>
    Date:   Thu May 30 16:27:50 2024 +0200

        image_category_wall convert None to str for category

    commit 00874576c354235f4655f1d433ec4382010e18e3
    Author: Dale Wahl <[email protected]>
    Date:   Thu May 30 14:54:51 2024 +0200

        image_category_wall fix float categories

    commit e0c55a8ae132bedef5da27ecbbb9489a094d454c
    Author: Dale Wahl <[email protected]>
    Date:   Thu May 30 12:51:42 2024 +0200

        download_images fix divide by zero when user can download all

    commit 3580fc9450501262badb8e61ef4b4df4b4c54322
    Author: Dale Wahl <[email protected]>
    Date:   Thu May 30 12:51:24 2024 +0200

        image_category_wall remove 'max' when user can use all images

    commit f2145bdeff1d68e46cdd3521ecbb61573f01a2f2
    Author: Dale Wahl <[email protected]>
    Date:   Wed May 29 17:59:23 2024 +0200

        rank_attributes: option to count missing data or blanks

    commit 01e7ab9677a75181bbedc62fa00e636ce2b17c18
    Author: Dale Wahl <[email protected]>
    Date:   Wed May 29 16:53:57 2024 +0200

        fix missing field strategy so default_stategy not overwritten on second loop

        default_stategy would be set to correctly to the callable, but overwritten on second loop (and map_missing is a dictionary at that point).

    commit 097f838af1f5f2748578dd9072eb9e3a8b3a7057
    Author: Dale Wahl <[email protected]>
    Date:   Tue May 28 12:16:08 2024 +0200

        add log_level arg to 4cat-daemon.py

        I've been using this forever and don't know why I haven't commited it

    commit fd3ac238e60f052889d99c71588170570a384900
    Author: Dale Wahl <[email protected]>
    Date:   Tue May 28 10:10:56 2024 +0200

        google & clarifai to csv had identical "type"

        possibly caused issue w/ preset

    commit 1b9965d40aa33035a73f685c13a1ab50cc877f78
    Author: Stijn Peeters <[email protected]>
    Date:   Mon May 27 15:54:20 2024 +0200

        Ensure file cleanup worker always exists

    commit 0e0917f2232e240df3412fd4df51cf0be19248b5
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:36:22 2024 +0200

        Also update Spacy model versions...

    commit f40128213529d154cfb77afa7aa67a72d5bb640f
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:32:35 2024 +0200

        *Actually* remove typing_extensions dependency

        ???

    commit ba3d83b824c5fb6fcb0aec5e1c36b35070d6e5d9
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:30:08 2024 +0200

        Update minimum Pillow dependency version

    commit 1c3485648bf2a911052eeeae4f293f303a944aec
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:27:27 2024 +0200

        Do not require typing_extensions explicitly

        This was required to ensure Spacy could load - looks like Spacy has since been updated to work with newer versions of typing_extensions as well

    commit 3828de83ba123254463a904392f24daec626c136
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:02:04 2024 +0200

        Bump version

    commit 8f0d098107a4bbc9d55cc6048f7a38f1d1891a32
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 17:01:28 2024 +0200

        Require non-broken version of emoji library

    commit 4b2ad805fcc99a83e46732fc991d98d78ef06c6c
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 13:11:03 2024 +0200

        Show worker progress in control panel if available

    commit 9144d4503f46108437616d6bc0cf4fde74df3aca
    Author: Stijn Peeters <[email protected]>
    Date:   Thu May 23 11:07:41 2024 +0200

        Bump version

    commit 807ab77101d197ec897640480a2140439d570c05
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 21:57:11 2024 +0200

        Fix Instagram upload with missing media URL

    commit d0b4840fd465b6d21657c3d50f9291ac911b6082
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:35:04 2024 +0200

        Comma comma comma

    commit 7fd2e14c9505d0ed1ac77dc09c24f766ea61ee6c
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:25:26 2024 +0200

        Fix progress indicator for scene extractor

    commit 661c42c2d083da7004335b0e14910935c3d392f6
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:12:21 2024 +0200

        Don't crash video hasher non non-str item IDs

    commit 1f280321cdde27a9909885fa2f64dbeffa549fb1
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:09:53 2024 +0200

        Do not crash timelines processor when metadata has unexpected format

    commit 572d03f1f368f0ad5f47e705a119b37646148d1d
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:09:30 2024 +0200

        More efficient video frame extractor

    commit 1b51d224ca544d7e2913238adbff2049412bc41e
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:04:27 2024 +0200

        Fix crash in video stack processor with ffmpeg < 5.1

    commit ddc73cb2e2f0985e64f84ca86bc167fa9e9dc81a
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 17:03:48 2024 +0200

        Helper function for determining ffmpeg version

    commit ef9dd482b2258c428584997dc661156f63f68b91
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 12:14:58 2024 +0200

        Allow absence of articleComponent in LinkedIn posts

    commit 060f2cd7f922e7fae337b0697f7c477442d21ef1
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 12:12:54 2024 +0200

        Cast post IDs to string when mapping video scenes

    commit ab34c415c9ada23763b45676639ce3e80a34f594
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 11:46:39 2024 +0200

        Twitter -> X/Twitter

    commit de6d97554ccb68375979e5ff09c7e65d8d70a6cd
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 22 11:45:19 2024 +0200

        Colleges -> Collages

    commit 30365580dc59b4d95e8a62d1b3c666bef60ce7e8
    Author: Stijn Peeters <[email protected]>
    Date:   Tue May 21 15:41:55 2024 +0200

        Explicit disconnect after Telegram image download

    commit 5727ff7230db42463a824f45d63f0b8343caac14
    Author: Stijn Peeters <[email protected]>
    Date:   Tue May 21 14:05:50 2024 +0200

        Catch TimedOutError while downloading Telegram images

    commit e0e06686e78976f971aac620267d7e009eaaadff
    Author: Sal Hagen <[email protected]>
    Date:   Mon May 13 13:01:42 2024 +0200

        Typo in LinkedIn search

    commit 51e58dde6ca21278a80f252a8c22dc83d87ace1f
    Author: Dale Wahl <[email protected]>
    Date:   Tue May 7 13:10:43 2024 +0200

        text_from_image: fix metadata missing (indent issue)

    commit c1f8ecc1674375bba2b2e38cb29c9d4d44098f0a
    Author: Dale Wahl <[email protected]>
    Date:   Tue May 7 09:45:25 2024 +0200

        text_from_image fix: ensure metadata success before attempting to update original

    commit 72dbf80db71499c59133e1128205b756d240b300
    Merge: d7561625 baacc86b
    Author: Stijn Peeters <[email protected]>
    Date:   Fri May 3 13:14:08 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit d7561625b127573fbb0332fbb713be6a3cb3d953
    Author: Stijn Peeters <[email protected]>
    Date:   Fri May 3 13:14:03 2024 +0200

        Comments without replies don't always have reply_comment_total

    commit baacc86b269612b4b0956345f8b9fa902df1b61f
    Author: Dale Wahl <[email protected]>
    Date:   Fri May 3 12:01:22 2024 +0200

        DSM fix and simplify GPU mem check

    commit 9b662e9f9b4f4ce194608c8e20a8fc50bc6d9ae3
    Author: Parker-Kasiewicz <[email protected]>
    Date:   Thu May 2 00:53:45 2024 -0700

        Adding Gab as a Data Source! (#401)

        * Can successfully import gab data, although
        can't tell if formatting is right becuase
        waiting on queued requests.

        * Version w/ different item types

        * Ingest Gab posts from Zeeschuimer

        * Small fix for merge conflicts (whoops)

        * Gab processing logic transferred from Zeeschuimer

        * fixing small errors for Gab data source

        * basic processing for truth social from Zeeschuimer

        ---------

        Co-authored-by: Dale Wahl <[email protected]>

    commit 3ecb8fd9c27aee4c457f03516794c6c4eac19c09
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 1 17:51:36 2024 +0200

        Fix duplicate line in views_admin.py

    commit 8b66ae7e467913f8e7571cf4b45493f63804266f
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 1 17:49:54 2024 +0200

        Allow processors to define which fields should be pseudonymised

    commit c973750c8cabb8698704c5997903e92d1de866d2
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 1 17:15:32 2024 +0200

        Allow auto-queue of pseudonymisation after import

    commit 49ad9f0ff785fd44ae494755b785c7fdf7c9cf15
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 1 17:08:35 2024 +0200

        Get rid of redundant and buggy next/copy_to implementation in Search class

    commit 106d3659e2fda89867d3a4f587c1c1addfaff2f7
    Author: Dale Wahl <[email protected]>
    Date:   Wed May 1 16:14:03 2024 +0200

        use current branch in settings

    commit 60bef4157d807f7c01ef3b425295244e91919f31
    Author: Stijn Peeters <[email protected]>
    Date:   Wed May 1 11:04:07 2024 +0200

        Nicer code

    commit 4182c436e4fb5109c5e041dc729f77a58d877889
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 30 16:19:36 2024 +0200

        Always shut down API worker only after everything else has been shut down

    commit e685108b3cbe5f005ce2df21906267071ad8118e
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 30 16:12:42 2024 +0200

        Properly interrupt expiration worker when asked

    commit 27a568eca7f2f3742223fef6285eaf80583e0fc4
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 30 13:40:50 2024 +0200

        Allow floats-as-strings as timestamps when importing CSV

    commit 2d2bbb9fdb9b426b8f4a80782f04257721a97f2e
    Author: Dale Wahl <[email protected]>
    Date:   Tue Apr 30 13:05:07 2024 +0200

        douyin: add consistency to map_item stats

    commit 289aa342c9912aceeca35887c079c72aa6ffbf52
    Author: Dale Wahl <[email protected]>
    Date:   Mon Apr 29 15:26:38 2024 +0200

        fix collection data in Douyin to handle $undefined

    commit 5b9b23fb1696bc1b69e1d902c0a2ad4b7d168984
    Author: Dale Wahl <[email protected]>
    Date:   Mon Apr 29 13:00:03 2024 +0200

        add scipy requirement to make compatible with gensim

        https://stackoverflow.com/questions/78279136/importerror-cannot-import-name-triu-from-scipy-linalg-gensim

    commit 7eab746e944f1ababe3dcd6a5d25387a64c2237d
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Apr 29 12:00:09 2024 +0200

        stupid, stupid, stupid

    commit 90577982ac05019a7ac76818a62f91e84dd65902
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Apr 29 11:56:22 2024 +0200

        Fix leftover iterate_mapped_items

    commit 57dbdf74c49c34c05784debb9f7e258da7ae7d54
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 26 15:26:39 2024 +0200

        Woops

    commit f11760d2c13e817e23cfa5e26b24f74cf817f65e
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 26 15:26:04 2024 +0200

        Update list of supported platforms in readme

    commit 760ff1cdeb006f70acaa00ded82fb3cbc7617c9d
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 26 12:13:28 2024 +0200

        Bump version

    commit 1fd78b2362840299e80f5540c9fedc1be3b06da1
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 25 12:58:24 2024 +0200

        Use MissingMappedField for Douyin fields undefined in the source data

    commit 6918baeabc7a08b6a63495c5d38c86b2c88bca44
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 25 12:31:11 2024 +0200

        Fix Douyin mapping failure if cellRoom is $undefined

    commit aad6208167c07686348234daff4dcf9cd036f5a5
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 25 12:30:53 2024 +0200

        Better error when trying to import data for unknown datasource

    commit 43c6ed646994111188bde66d5bcfe4ab602e8512
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 25 12:30:31 2024 +0200

        Fix Twitter mapping on URLs that cannot be expanded

    commit 91c3da176fad90ba16871fa8892fac5a0df13785
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 25 12:12:54 2024 +0200

        Safe cast to int in CrowdTangle import

    commit 765f29e9232afdf284ab1667b0f371951e0bf2f4
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Apr 24 12:37:02 2024 +0200

        Fix erroneous shell command in front-end restart trigger

    commit c99fdd9eca8f5925d93375cac846e8b7633194fb
    Merge: 342a4037 bc1deddf
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 23 12:29:35 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 342a4037411e7ccaa50b25a4686434bec39e2568
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 23 12:29:32 2024 +0200

        Enable TikTok comment and Gab import by default

    commit bc1deddf57aa5049fb79622c4309fb7051d77bdb
    Merge: 537d7645 3c644f01
    Author: Dale Wahl <[email protected]>
    Date:   Tue Apr 23 12:16:37 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 537d76456e2826e8c4dd7026ec5b2d436370fad8
    Author: Dale Wahl <[email protected]>
    Date:   Tue Apr 23 12:14:46 2024 +0200

        do the todo: fix column_filter to match exact/contains with int

    commit 3c644f01baeca34e712d36efdf5c77ccd3ef7a06
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 23 11:16:07 2024 +0200

        Don't crash on empty URLs in dataset merge

    commit f1574c26e2e3bdc40cc04bb8193cf6d3fa14792b
    Author: Dale Wahl <[email protected]>
    Date:   Thu Apr 18 12:08:55 2024 +0200

        fix: do not fail when no processor exists

        weird! failed on a dataset `type="custom-search"` which was created by an import script w/ no processor. Also likely would make deprecated processors fail.
        500 server error:
        ```
        File "/opt/4cat/common/lib/dataset.py", line 800, in get_columns
             return self.get_item_keys(processor=self.get_own_processor())
           File "/opt/4cat/common/lib/dataset.py", line 405, in get_item_keys
             keys = list(items.__next__().keys())
           File "/opt/4cat/common/lib/dataset.py", line 337, in iterate_items
             if own_processor.map_item_method_available(dataset=self):
         AttributeError: 'NoneType' object has no attribute 'map_item_method_available'
        ```

    commit 50a4434a37d71af6a9470c7fc4a236b043cbfb4d
    Author: Stijn Peeters <[email protected]>
    Date:   Wed Apr 17 14:30:58 2024 +0200

        Add "TikTok comments" data source

    commit c43e76daae3c2e6ecdb218ee749315b985eccca4
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 16 17:59:25 2024 +0200

        Allow notifications per tag

    commit 36984104e674e8577756bfc3fdd5c72f6569d9e1
    Author: Dale Wahl <[email protected]>
    Date:   Tue Apr 16 17:25:38 2024 +0200

        fix: pass dataset to get_options when queuing processors

    commit 59cb19a3c88f7f4a4ac02d0b7a891afde50ea069
    Author: Dale Wahl <[email protected]>
    Date:   Tue Apr 16 10:55:29 2024 +0200

        fix: dicts are shared in classes & you cannot delete a key more than once

        randomly found this; probably as no one else has reddit enabled!

    commit 3ec9c6ea471bcdbe9fb1caad1e5fe1502a705444
    Author: Dale Wahl <[email protected]>
    Date:   Mon Apr 15 13:22:19 2024 +0200

        fix results page error when dataset was being created; do not check for resultspage updates when user not focused on page

    commit db05ae5e565248e865e67b8ea60e6653357bb1f4
    Author: Dale Wahl <[email protected]>
    Date:   Mon Apr 15 11:27:33 2024 +0200

        on import file, differentiate between missing field(s) and unable to map item

    commit 940bac72c7e53bec9e136867c13e2a0a355961a4
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 12:57:48 2024 +0200

        Case-insensitive username/note matching in user list

    commit d0f34245bd07b5ad2fd3e90754ef0264ffc350a9
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 12:29:12 2024 +0200

        Only determine settings tab name in one place

    commit 9f69d7bc0bbb657be1e725d5fb3fe350b7205bff
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 12:20:34 2024 +0200

        git != github

    commit 9b4981d8c7358f31ed65d9f161d556e578389801
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 11:56:04 2024 +0200

        Fix issues with user tags

        Fix number of users in tag overview; allow filtering by user tags on user list; don't delete all user tags when deleting one

    commit 9e8ccd3a78765acdfd2005eaa215dc0dc07266e0
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 11:32:45 2024 +0200

        Do not hide all non-hidden child processors

        lol

    commit 3f15410af3a278f5644f41f49e25498a1fac3c76
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 11:23:52 2024 +0200

        Disable standard video downloader for Telegram

    commit 94c814b9cab2ae2be10d5c5d3f6cfe20898e349c
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 11:14:16 2024 +0200

        Telegram video downloader processor

    commit d36254a188947fff507e8df59f793e98b3be1570
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 12 11:14:04 2024 +0200

        Better styling for 4CAT settings, alphabetic order, submenus

    commit 808300fa109f306a921f2048b2cf4b6dafc4ba5f
    Author: Stijn Peeters <[email protected]>
    Date:   Thu Apr 11 14:44:32 2024 +0200

        Fix multiselect in UI

    commit 131a0eca0ad514b1ee57803e5c560ab0e56de42d
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Apr 8 18:28:04 2024 +0200

        Do not attempt to load crashed file as module in Slack webhook. Fixes #422 (hopefully)

    commit 6d8cb067bc12f8be68749f74a7291e0849494225
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 19:43:58 2024 +0200

        Allow comma-separated list when adding new dataset owners

    commit 2612aea49f63c37ac691cc89c553c764ead2344f
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 19:40:04 2024 +0200

        Include number of users with tag on tag page

    commit 39f2ec40faa3b8493bd5525279aeaeb2e4f586e0
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 19:26:02 2024 +0200

        Fix confirmation before deleting user tag

    commit b00a410a3441e7f2a9d73a9f2dfb0f4ef70ea8a5
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 19:25:01 2024 +0200

        Add link to users with tag on tag admin page

    commit 3ef3e5ec9adbd8ddd128ce2b3f8fa3b1de1297e3
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 18:49:25 2024 +0200

        Give filtered datasets a more sensible label, based on source dataset

    commit 0d5870b78fb73cb58231736cc8a2efbb0b3cd88a
    Author: Dale Wahl <[email protected]>
    Date:   Fri Apr 5 17:40:57 2024 +0200

        update iterate methods (#418)

        * working to make iterate_mapped_item primary method used by processors and elsewhere in 4CAT; iterate_item method only internally (and provide item directly as is from file) with iterate_mapped_object as intermediate method to use map_missing method and handle missing values as well as warn if needed

        * switch from iterate_items to iterate_mapped_items; careful attention to item_to_yield allowing a choice of the original item, the mapped item, or both

        * revert some unecessary renaming

        * fix annotations bug...

        this fixes the bug, but i noticed that the notations saved in the database do not have the correct post IDs.

        * Introduce DatasetItem class and simplify iterate_items

        * Don't crash when no item mapper

        * ...actually commit the DatasetItem class

        * Fix typos in comment

        ---------

        Co-authored-by: Stijn Peeters <[email protected]>
        Co-authored-by: Sal Hagen <[email protected]>

    commit 17b77351c51ace21b7057276bbae9da2643a3fc4
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Apr 5 16:20:19 2024 +0200

        Allow dynamic form options in processors (#397)

        * Allow dynamic form options in processors

        * Allow 'requires' on data source options as well

        * Handle list values with requires

        * Wider support for file upload in processors

        * Log file uploads in DMI service manager

        * fix error w/ datasources having file option

        * fix fourcat.js use of checkboxes for dynamic settings

        * Fix faulty toggleButton targeting

        ---------

        Co-authored-by: Dale Wahl <[email protected]>

    commit 693fcedc93ee4476a60d0e0876e688f82a8526fa
    Author: Dale Wahl <[email protected]>
    Date:   Fri Apr 5 15:59:10 2024 +0200

        Add method to processors to toggle display in UI (#411)

        * add ui_only parameter to DataSet.get_available_processors() and BasicProcessor.display_in_ui()

        Allow using `display_in_ui` to hide processors from UI but allow them to be queued either via API or presets. This avoids issue of is_compatible_with() having to be used to hide processors with sometimes ill effects.

        * keep same data structure....

        * don't delete twice; it's redundant... and raises an error

        * Rename arguments/properties

        * Exclude hidden processors in top level view

        * fix logic

        * Exclude in child template as well

        ---------

        Co-authored-by: Stijn Peeters <[email protected]>

    commit 3cd146c2908da6b3a06a0c1511bf042c4223af0f
    Author: Dale Wahl <[email protected]>
    Date:   Thu Apr 4 16:41:39 2024 +0200

        fix: whoops remove debug

    commit daa7291e813e62fed4600a4acb8430004836cb86
    Author: Dale Wahl <[email protected]>
    Date:   Thu Apr 4 15:16:30 2024 +0200

        CSV preview add hyperlinks if "url" or "link" in column header

    commit 5f2d6e65bad4f71b2c3cc75d2cdab76f15671d4c
    Author: Dale Wahl <[email protected]>
    Date:   Thu Apr 4 15:16:01 2024 +0200

        blip2 processor to work w/ DMI Service Manager

    commit fe881dec18778d99ac4a0f60ca40a1f43fdb1689
    Author: Dale Wahl <[email protected]>
    Date:   Thu Apr 4 09:53:30 2024 +0200

        catch AttributeError on slackhook if unable to read file

        ever vigilant against a lack of flavour...

    commit 2808256b1fabf2e6e8a5a94aad98af60c50fb7b0
    Merge: 14123847 eb474640
    Author: Dale Wahl <[email protected]>
    Date:   Wed Apr 3 17:28:40 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 14123847b5852bf0e7c84fced6c2380165ec93f6
    Author: Dale Wahl <[email protected]>
    Date:   Wed Apr 3 17:28:38 2024 +0200

        staging_areas should not be made for completed datasets (else they may be deleted prematurely)

    commit eb474640559ee3e914d9c95adb60be09b906f1d6
    Merge: bbdf2ab9 3f8b285c
    Author: sal-phd-desktop <[email protected]>
    Date:   Wed Apr 3 16:50:54 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit bbdf2ab9b4292c14911ac01b481c829defa85e5c
    Author: sal-phd-desktop <[email protected]>
    Date:   Wed Apr 3 16:50:36 2024 +0200

        Helper script to export the 'classic' 4CAT 4chan data

    commit 3f8b285c44c33a3ce08e885889b311bc454a70ea
    Merge: 8f40f3f5 f7cc5b8d
    Author: Sal Hagen <[email protected]>
    Date:   Wed Apr 3 12:12:17 2024 +0200

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 8f40f3f5222a63e93f46eb3b57791d10060a0cc8
    Author: Sal Hagen <[email protected]>
    Date:   Wed Apr 3 12:12:13 2024 +0200

        Tumblr search typo

    commit f7cc5b8d012dec3d8e0c8847ae16c662e82040b5
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Apr 2 12:32:51 2024 +0200

        More/less flavour in restart worker

    commit 073587efc581adca0608988573ac83ea8b0c93d0
    Author: Dale Wahl <[email protected]>
    Date:   Wed Mar 27 14:15:27 2024 +0100

        create favicon.ico (remove from repo)

        be sure to keep webtool/static/img/favicon/favicon-bw.ico as basis

    commit 28d733d56204231f4089660ff61282174aac7aed
    Author: Dale Wahl <[email protected]>
    Date:   Wed Mar 27 09:44:45 2024 +0100

        add allow_access_request check to request-password page

        clicking it would only return the user to the login page anyway, but better not even show it

    commit 1f2cb77e3cb0fc9b5403da52aaa925b33089d18f
    Author: Dale Wahl <[email protected]>
    Date:   Wed Mar 27 09:37:51 2024 +0100

        fix can_request_access to use 4cat.allow_access_request option

    commit 0d66f11d3619af798d5acc41dbf4fe118b7ddad8
    Merge: 25825383 05b3fc07
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Mar 26 17:54:48 2024 +0100

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 2582538303e31470ed6bf8a01645f7b45af15e5d
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Mar 26 17:54:45 2024 +0100

        More permissive timeout for pixplot

    commit 05b3fc0771ded10dc55db799e8f47e42add08d43
    Author: Dale Wahl <[email protected]>
    Date:   Tue Mar 26 14:01:59 2024 +0100

        remove redundant call of Path

    commit e4a93442efb84d73d6a4c9af9bc46a8f3e3fdda2
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Mar 26 11:52:09 2024 +0100

        Include column with link description in Telegram mapping

    commit 876f4a4b6df51ec4b30a048c32191438b6778f90
    Author: Dale Wahl <[email protected]>
    Date:   Mon Mar 25 14:48:47 2024 +0100

        douyin handle image posts

    commit 81ad61baabaf965b1c848f55a80c23bd3e1a9000
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Mar 25 08:01:44 2024 +0100

        Accept non-numeric IDs in Telegram image downloader

    commit a8b36dc5682df7c16e25474ea8fdbfc4f12f9d46
    Author: Stijn Peeters <[email protected]>
    Date:   Sun Mar 24 23:15:51 2024 +0100

        Ensure unique IDs for Telegram datasets

    commit 4a3e9ffee072c4d3efb7bfd8744369b46f19eef2
    Merge: 0c119130 d749237e
    Author: Stijn Peeters <[email protected]>
    Date:   Sun Mar 24 22:56:59 2024 +0100

        Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

    commit 0c11913049aabb5a83ffe26d58bdf17affdbc0b9
    Author: Stijn Peeters <[email protected]>
    Date:   Sun Mar 24 20:09:10 2024 +0100

        Better string formatting in Telegram image downloader

    commit 8a7da5317defdafb5bdbf74dcbeb68e464fa21f4
    Author: Stijn Peeters <[email protected]>
    Date:   Sun Mar 24 20:06:06 2024 +0100

        Add 'link thumbnails' op…
@stijn-uva stijn-uva merged commit 07094f8 into master Sep 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants