Replies: 2 comments 5 replies
-
In the spirit of clarity: what's an immediate use case? Having that background information is going to be valuable for a discussion. The main pain-point of aggressive scraping is that it can be quite taxing on web sites if many people do them, especially in a regular period, which has to be factored into a continuous scraping solution. |
Beta Was this translation helpful? Give feedback.
-
You describe the output of continuous scraping as arrays of JSON encoded fragments, but then what does stash do with that output? It seems like your idea for continuous scraping is based on the idea of stash being able to aggregate and store metadata for content that you haven't downloaded. However, the way stash is now, that kind of scraping for metadata can only apply to performers. Every scene and image in stash is still tied to a file. I think there's plans to decouple scenes and images from files, but that needs to be done first before something like continuous scraping can be considered. Even the idea of just scraping everything from a single site and not just a specific scene would need to be considered first before the idea of continuous scraping. |
Beta Was this translation helpful? Give feedback.
-
As of right now, scrapers are basically a lookup DB. They return scene/performer details based on an existing identifier - a name or a URL. The problem with that is you have to have at least some kind of set of performers/scenes/photos before you can scrape any more data.
What would be great to have is the ability for a scraper to specify actions that are regularly ran: one that returns a list of perfomers, another one returning scenes, another one for photos. They would be defined as follows:
performers
{"lastRan": "2000-10-31T01:30:00.000Z"}
scenes
{"lastRan": "2000-10-31T01:30:00.000Z"}
photos
{"lastRan": "2000-10-31T01:30:00.000Z"}
All of those actions would be required to only return new items that have been added or changed since the last time the scraper was ran. The interval at which those are ran should be a configuration option.
Beta Was this translation helpful? Give feedback.
All reactions