You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: We want to fully crawl the website again, not just refresh the content of the pages we already managed to crawl. Imagine a news website with new pages all the time, users will expect us to capture these brand new pages (not just update the old pages we already scraped).
Functions
Delete scrape results from {base_url} on {course_name}. See the existing delete functions for how this should work.
Add a boolean flag for “only scrape this website vs include other websites that are linked” where website == base_url.
The text was updated successfully, but these errors were encountered:
or Github Actions for a cron job, equally good.
Github Actions will NOT work for a cron job because we need to create a SEPARATE new cron job for every single website ingested. So we do need Upstash to easily create many cron jobs programmatically, and via GUI.
OR
We just have ONE cron job that refreshes all websites at the same time every day. Manual refresh also available to end users on website GUI button click.
Start implementing the ‘re-scrape’ feature.
Note: We want to fully crawl the website again, not just refresh the content of the pages we already managed to crawl. Imagine a news website with new pages all the time, users will expect us to capture these brand new pages (not just update the old pages we already scraped).
Functions
website == base_url
.The text was updated successfully, but these errors were encountered: