Web scrape "automatic refresh" option #39

KastanDay · 2023-08-14T19:10:32Z

Start implementing the ‘re-scrape’ feature.

Note: We want to fully crawl the website again, not just refresh the content of the pages we already managed to crawl. Imagine a news website with new pages all the time, users will expect us to capture these brand new pages (not just update the old pages we already scraped).

Functions

Delete scrape results from {base_url} on {course_name}. See the existing delete functions for how this should work.
Add a boolean flag for “only scrape this website vs include other websites that are linked” where website == base_url.

The text was updated successfully, but these errors were encountered:

KastanDay · 2023-08-14T20:22:41Z

Maybe use this (on front-end side...): https://upstash.com/blog/qstash-periodic-data-updates

~~or Github Actions for a cron job, equally good.~~
Github Actions will NOT work for a cron job because we need to create a SEPARATE new cron job for every single website ingested. So we do need Upstash to easily create many cron jobs programmatically, and via GUI.
OR
We just have ONE cron job that refreshes all websites at the same time every day. Manual refresh also available to end users on website GUI button click.

KastanDay assigned jkmin3 Aug 14, 2023

KastanDay added this to UIUC.chat Development Aug 14, 2023

KastanDay moved this to 📋 Backlog in UIUC.chat Development Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web scrape "automatic refresh" option #39

Web scrape "automatic refresh" option #39

KastanDay commented Aug 14, 2023

KastanDay commented Aug 14, 2023 •

edited

Loading

Web scrape "automatic refresh" option #39

Web scrape "automatic refresh" option #39

Comments

KastanDay commented Aug 14, 2023

Functions

KastanDay commented Aug 14, 2023 • edited Loading

KastanDay commented Aug 14, 2023 •

edited

Loading