Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web scrape "automatic refresh" option #39

Open
KastanDay opened this issue Aug 14, 2023 · 1 comment
Open

Web scrape "automatic refresh" option #39

KastanDay opened this issue Aug 14, 2023 · 1 comment
Assignees

Comments

@KastanDay
Copy link
Member

Start implementing the ‘re-scrape’ feature.

Note: We want to fully crawl the website again, not just refresh the content of the pages we already managed to crawl. Imagine a news website with new pages all the time, users will expect us to capture these brand new pages (not just update the old pages we already scraped).

Functions

  1. Delete scrape results from {base_url} on {course_name}. See the existing delete functions for how this should work.
  2. Add a boolean flag for “only scrape this website vs include other websites that are linked” where website == base_url.
@KastanDay
Copy link
Member Author

KastanDay commented Aug 14, 2023

Maybe use this (on front-end side...): https://upstash.com/blog/qstash-periodic-data-updates

or Github Actions for a cron job, equally good.
Github Actions will NOT work for a cron job because we need to create a SEPARATE new cron job for every single website ingested. So we do need Upstash to easily create many cron jobs programmatically, and via GUI.
OR
We just have ONE cron job that refreshes all websites at the same time every day. Manual refresh also available to end users on website GUI button click.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: (📋 Backlog)
Development

No branches or pull requests

2 participants