Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squash doc history in separate task #58

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dlqqq
Copy link
Contributor

@dlqqq dlqqq commented Nov 28, 2022

Squashes document history in a separate scheduled task rather than on write. This will significantly improve performance for document writes in the case where the time since last update exceeds the document TTL.

Open concerns

  • Is there any good way of stubbing out asyncio.sleep() in the unit tests? Testing this is tricky because it's not as simple as doing unittest.mock.patch("asyncio.sleep"), since that causes the document history to be squashed on every write. What I'm looking for is a utility that lets me imperatively advance the event loop clock, e.g. loop.advance(123). I found asynctest, but it doesn't seem to work and is abandoned.

@davidbrochart
Copy link
Collaborator

I like that we don't need to query the last timestamp anymore.This should now make it easy to implement document TTL for FileYStore. Do you want to do it?
One thing I'm concerned about is what happens if there is a write at the same time as we're squashing updates? Could you add a test for that?

@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 29, 2022

@davidbrochart

I like that we don't need to query the last timestamp anymore.This should now make it easy to implement document TTL for FileYStore. Do you want to do it?

Oh yeah, totally forgot about that. Let me do so.

One thing I'm concerned about is what happens if there is a write at the same time as we're squashing updates? Could you add a test for that?

Should be the exact same behavior as a write occurring after a squash, since aiosqlite essentially just stores SQL statements in a queue and polls from it whenever it's finished executing the current one. But yes, I can add a test for this.

@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 29, 2022

@davidbrochart Wait, we still need a timestamp if we want to handle the case where the server is not kept alive during a document's TTL. Take this scenario:

  • add a bunch of updates
  • shut down the server
  • repeat

In this scenario, the document history grows without bound. I think we need to handle this case as it's fairly common for users running Jupyter locally on a laptop (in which case disk space is also tight), but I'd like @ellisonbg's opinion on this matter as well.

@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 29, 2022

To handle the above case, we'd need to squash on __init__() and do the same query we were doing before.

@ellisonbg
Copy link

I agree we do need the timestamp in cases where the server restarts, but why wouldn't we need it otherwise? I am trying to understand the logic here. My idea of how this would work is this:

  • Every so often (configurable) the background task check to see how long ago the last update for a document was made.
  • If that time interval is longer than the TTL (also configurable) delete the history.

This logic is needed to ensure that this cleanup doesn't happen while users are editing it.

@ellisonbg
Copy link

In the current state of the code, is the idea that you simply count the number of updates to the doc and if it hasn't changed in the TTL then you collapse?

@dlqqq
Copy link
Contributor Author

dlqqq commented Nov 29, 2022

@ellisonbg The implementation in this PR basically just creates a task that sleeps for the duration of the document TTL and then squashes the document history . Any updates cancel the existing task and create a new task. Therefore having a process repeatedly querying the database isn't needed, and thus the timestamp isn't needed for this case. This is how cron is typically implemented.

@ellisonbg
Copy link

I think the key question here is this: when is it safe to squash the history?

Here is my understanding (@davidbrochart please add more details or correct any inaccuracies here). When the first user opens a document, the full state of the document is loaded into the memory of the server. As that user makes changes, updates are 1) forwarded to other users, 2) stored in the db, 3) applied to the in memory document on the server. The server is unable to distinguish between users closing the document or network disruptions (both of these appear as network disconnections). The database is used when a client disconnects and reconnects to get them any missing updates. When all clients have been disconnected for a certain amount of time (I think 1m be default) the state of the document is flushed from memory. The db is also used the next time a user opens the document after it is flushed from memory.

The failure mode of squashing history is that we would squash updates that are needed by a user. Let's imagine the following scenario:

  • N users have the document open and are editing it.
  • User 1 closes their laptop and goes to lunch at 12 noon local time. This is registered as a client disconnect.
  • While user 1 is at lunch, other users edit the document and generate new updates.
  • At 12:59, updates are squashed.
  • At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

A few questions:

  • What will actually happen in this case. Is everything smart enough to detect this and still reassemble the current state of the document?
  • To cover situations like this safely, do we need to run the history squashing during periods of long inactivity. We could determine this by waiting some (long) period of time after all clients disconnect.
  • Another way we could mitigate this risk is to do more granular history squashing. For example, rather than squashing all the history, we could squash the history into days, and only for days older than 1 week. That way, users would always have the full history for a week, and then daily beyond that.

@davidbrochart
Copy link
Collaborator

As that user makes changes, updates are 1) forwarded to other users, 2) stored in the db, 3) applied to the in memory document on the server.

That's correct, just mentioning that 3) is the cause of 1) and 2). We apply changes from the clients to the "server peer", and listen to changes made on this peer to broadcast them to clients and store them.

The database is used when a client disconnects and reconnects to get them any missing updates.

Yes, but only if no client is accessing the document for some time. It there is at least one client accessing the document, the missing updates will be sent to new clients from memory, as the in-memory document would still be present.

  • At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

Unless the document is still in memory, in which case user 1 will get the full history of changes. Otherwise, yes they will get a document with no history, but I think this is the desired behavior.
The document is kept in memory if at least one user is accessing it, and for some time after all clients disconnect from this document.

If we agree on this, I think your questions don't apply anymore, right?

@davidbrochart
Copy link
Collaborator

davidbrochart commented Nov 30, 2022

  • At 1:00, user 1 returns and opens their laptop. At this time, the updates needed by that user no longer exist and have been squashed.

Oh I see your point. Let's assume that updates have been flushed from memory to disk, and squashed.

  • What will actually happen in this case. Is everything smart enough to detect this and still reassemble the current state of the document?

No, I don't think it will give user 1 their missing updates. I will give them a big update consisting of the whole document, leading to content duplication.

  • To cover situations like this safely, do we need to run the history squashing during periods of long inactivity. We could determine this by waiting some (long) period of time after all clients disconnect.

Yes, the point being making sure that all clients closed their document in the front-end, i.e. willingly disconnected.

  • Another way we could mitigate this risk is to do more granular history squashing. For example, rather than squashing all the history, we could squash the history into days, and only for days older than 1 week. That way, users would always have the full history for a week, and then daily beyond that.

True, although this will never eliminate the risk.

Actually the real solution would be to do what Kevin once suggested: introduce the notion of document session. If the user reconnects with a document session, and that document session is not anymore in the back-end, invalidate the shared document in the front-end, and recreate it from the back-end.

@dlqqq
Copy link
Contributor Author

dlqqq commented Dec 1, 2022

@davidbrochart Revision summary:

  • Fixed test flakiness by awaiting the squash task directly in tests rather than doing await asyncio.sleep(ystore.document_ttl + 0.1)
  • Added test for behavior with a simultaneous write occurring at document TTL
    • Not sure how useful this is, given that this tests the execution order of tasks scheduled to run at the same tick of the event loop, which is likely undefined behavior in Python. Right now, it appears that an update written at the document TTL is included in the squash.
  • Squash updates on init by doing the same timestamp query we were doing before

Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks a lot David.
I'll leave it open so that others can review.
I guess we can implement TTL for FileYStore later.

@fperez
Copy link

fperez commented Feb 22, 2023

Thanks for tagging this issue in #14031 @davidbrochart - I am not familiar with all the details, but it seems indeed that some very robust logic for invalidation and full document refresh in the frontend is necessary. Google Docs evidently does that on occasion - I've had it tell me that my document was stale and forced a reload, after refusing any more edits into a doc.

I'm spitballing here, not knowing the implementation details, but it seems to me that any squashing should be done "behind a barrier" - a point in the history that defines a fixed state that updates can't traverse with any granularity. Anyone needing to sync state across that barrier simply needs to re-sync to the state at the barrier, and then get the updates after that.

I'll come to the team meeting tomorrow and hopefully we can discuss this a bit. I was willing to dogfood RTC in production for a bit, but at this point I may need to pull the plug unless we find a viable solution path quickly, as I can't really tolerate any data loss/damage in that environment. It would be a shame to lose the opportunity of this environment for real-world feedback on various aspects of UX and performance, so 🤞 that we can figure this out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants