Is it imperative that the Bluesky firehose events be strictly processed in sequential order? #2586
Replies: 2 comments 5 replies
-
In short, you can process lots of record operations (aka "ops" or "events") in parallel! But there are some limitations and considerations there. I'll speak a little to what we've learned operating the Bluesky appview:
What we've done (roughly) is keep lots of queues. Each queue has some repositories assigned to it. Each op is placed onto the appropriate queue based on the repo that caused the record op. Then those queues may be processed serially. |
Beta Was this translation helpful? Give feedback.
-
Unrelated to your question, but this sounds a bit like how https://brid.gy/ (Bridgy classic) can search for links to a given web site and backfeed them as webmentions! https://brid.gy/about#which Right now it only does that for Reddit, not Bluesky, but it'd be easy to do it for Bluesky too: snarfed/bridgy#1576 |
Beta Was this translation helpful? Give feedback.
-
My use case is that I'll be building an application that will track posts that contain domain specific URLs, along with any quote posts or reposts of that 'original' post. For each of these posts, I'll be tracking the user profile (fetching via the API) and then tracking any updates to that profile (handle changes, profile update events, account de-activations etc).
The simplest solution would be to process each event according to its sequence, and that might be good enough for now or a few months. However, I want to build something that can scale from the offset, if Bluesky traffic increases or the no. of domains I need to track increases (not mutually exclusive). I plan to scale this via specific event handling processors (e.g. a processor for new posts, a processor for quote posts etc.).
Let's assume for now that the number of new posts is 5:1 compared to the no. of quote posts events. That means I'll probably want to have 5x the amount of processing capability to handle new posts than quote posts.
My question is - given the fact that a quote or reposts depends on a specific order of events (i.e. the original post first needs to be processed, hence it's a bounded event) but an original post is not bounded (because it exists in isolation), can I therefore process new posts as soon as they arrive, without having to wait until other processors (e.g. reposts) have processed in strict order? Reposts and Quote posts would be bounded by the last processed event cursor of the new posts processor, therefore preventing a quote post from existing before the original post is processed (I plan to hydrate the quote and reposts with data from the original post).
Similarly, deletions and undo repost processors would be bounded by the last processed event cursor of the reposts and quote post processors, therefore preventing the case where a repost is deleted before it's actually processed.
Hopefully my question makes sense. If this is an incorrect approach, is there any other approach I should consider instead?
Beta Was this translation helpful? Give feedback.
All reactions