Persistent Checkpointing PR (#2184) #3406

theIDinside · 2022-12-01T11:23:51Z

This is the first version of a PR that attempts to provide the functionality requested by issue #2184

rr create-checkpoints -i <some_interval> [-s <some_start_event>, -e <some_end_event>] where last two params are optional.
rr replay -g <evt> spawns the session from the most recent PCP before <evt>. It also spaws the most recent PCP, if -f <pid> is used, i.e. it finds when <pid> is created and spawns first PCP before that.
rr replay uses PCP during reverse execution
rr rerun uses PCP as well

Both the replay and rerun command now takes --ignore-pcp to ignore any PCP's and I've made spawning from PCP the default behavior of both commands.

2 commands has also been added to the spawned GDB; write-checkpoints and load-checkpoints.

The last point about persistent checkpoints being created at record time is not provided by this PR, but I'm willing to attempt to add that in a future PR, now that I have a little insight into how this would/could/maybe should work.

At this time, little to no optimizations are performed. Each mapping in the process address space is serialized to disk and it is currently not compressed in any way. Compressing data that goes into anonymous mappings should be fairly simple to implement as this data will get copied into memory during restore of a PCP, while file backed mappings (like executable data for instance) can not be compressed as easily. One wants to map as much file backed as possible, as this is not necessarily committed to physical memory immediately, which is the case with copying data into mappings.

Other optimizations that possibly could be done, is to instead of creating each checkpoints "from scratch", is to during restore of PCP's, reconstitute the first one (at event N), then when reconstituting the following checkpoint, fork the first and make the changes required to that one. As it stands right now, it creates a new session for each checkpoint. Theoretically this consumes more memory. Forking checkpoint N+1 from N and changing the address space where needed, I think would mean that less memory is used, I think.

Also, if anybody has any ideas on how one could possibly write tests for something like this, they would be most welcome to share those thoughts with me.

khuey

This is very much not a complete review, but I do have some early comments and questions that I want to pose to you.

Overall this looks pretty good and more or less like what I would expect.

src/Event.h

src/Event.cc

src/Task.h

src/ReplayTimeline.h

src/ReplayTimeline.cc

src/stream_util.cc

CMakeLists.txt

khuey · 2022-12-30T23:07:11Z

src/ReplaySession.cc

+                              index == std::string::npos ? 0 : index + 1);
+      AutoRestoreMem mem(remote, name.c_str());
+      remote.infallible_syscall(syscall_number_for_prctl(leader->arch()),
+                                PR_SET_NAME, mem.get());


This should be shared with Task::copy_state somehow.

Refactored this out into Task::set_name as per commit b29ff57

Also removed the call to update_prname in Task::copy_state as it's unnecessary - we already know the prname, so the new Task::set_name just sets Task::prname to the parameter passed in

khuey · 2022-12-30T23:26:11Z

src/rr_pcp.capnp

+  offset @7 :UInt64;
+  mapType :union {
+    file :group { # mapping of a file
+      skipMonitoringMappedFd @8 :Bool; # unsure if needed, or how it's used


This is used to communicate the decision about whether or not to monitor an mmaped file during recording (see monitor_this_fd in record_syscall.cc's process_mmap). Whether or not there's a monitor affects the behavior of the syscallbuf so it has to be the same in recording and replay.

I don't think you need this for PCPs because you are separately restoring the state of the FdTable.

Removed it.

khuey · 2022-12-30T23:30:00Z

src/AddressSpace.h

+
+  void set_uses_syscall_buffer(bool uses_syscall_buffer = true) {
+    syscallbuf_enabled_ = uses_syscall_buffer;
+  }


I think we'd benefit from a method on AddressSpace that restores auxv, do_breakpoint_fault_addr_, and syscallbuf_enabled_ all together.

khuey · 2023-04-21T16:27:48Z

This has been languishing for a while due to it's sheer size. To move this forward, I propose we just review the parts of it that integrate with the rest of rr to make sure it won't regress anything, and then take it. We can treat the checkpoint format as a work-in-progress that we can update at any time, and figure out the testing situation as we go. Does that sound reasonable @rocallahan?

rocallahan · 2023-04-21T22:49:05Z

Sure

theIDinside · 2023-04-24T22:55:16Z

This has been languishing for a while due to it's sheer size. To move this forward, I propose we just review the parts of it that integrate with the rest of rr to make sure it won't regress anything, and then take it. We can treat the checkpoint format as a work-in-progress that we can update at any time, and figure out the testing situation as we go. Does that sound reasonable @rocallahan?

I will get to the minor changes you requested previously, as soon as possible. I'm also open to any input about literally anything. Design is not my strongest suit, which will probably become apparent when looking at the code. And though "it seems to work" is not exactly a great metric (it's terrible, maybe?) I have used this feature quite a bit on my own and haven't noticed any issues so far.

Most of what I wrote, was written in a way to not change the original RR code (except the very few parts that required it, like moving internal class declarations of structs). This probably means that there exists code duplication, and some "patch work"/hacks, if that makes any sense.

I'm still at a loss for how one would test this, the best idea I've came up with so far (haven't implemented it though) is to write python scripts that can drive GDB, and do things like set checkpoint, serialize, kill process, deserialize, verify process space, etc. That would be quite a "chunky" end-to-end test though, but it's the only idea I can think of. But it's not just tracee-land that needs to be in a good state, supervisor-land also needs to be in an identical state. So off the top of my head I can think of right now, some of the things to consider are;

Tracee

Memory mappings are all in place
Memory contents are identical (after deserialization)

Supervisor (the "rr process")

Memory mapping meta data is identical
Frame-time meta data is identical, like ticks, event count etc
Last signal seen
Task start events
VM start events
file table stuff and many more things

Some of these things are helped out by the fact that RR asserts if it finds itself in an unexpected state.

khuey · 2023-10-11T04:22:00Z

If you have time, this would be a good time to rebase this PR as we just shipped an rr release and don't have anything substantial planned for a while.

theIDinside · 2023-10-24T14:02:51Z

If you have time, this would be a good time to rebase this PR as we just shipped an rr release and don't have anything substantial planned for a while.

I'm starting to work on that now. I'm not exactly superb at git, but I've got some help with it, so hopefully I'll get some results here in the week of making this PR build & work rebased on this new release.

GitMensch · 2023-12-27T19:42:51Z

Friendly ping @theIDinside - faster replay will definitely be useful :-)

- Moved into util.cc - Added forward_to to skip trace data to some arbitrary point in time

We need to be able to expose this data so it can be serialized.

Digs out original executable image that this task was forked from, or in the case of exec, exec'd on. This is required for persistent checkpointing, so that the names in the proc fs corresponds to a correct name at replay time (i.e. has the same behavior/looks the same in proc fs as a normal replay). The thread name is not what should be showing up in /proc/tid/comm, but the actual executable. So we need to be able to find this "original exe" of the task.

Required for the create checkpoints command, etc. to determine what events in the trace are checkpointable, when not having a live session. In future commits/PRs, remove the static function in ReplaySession.cc` that does the same thing and use this member function on Event instead.

Gets additional proc fs paths for a task, in this case /mem. Required for persistent checkpointing to figure out on how to handle mappings and what to serialize (and what not to serialize).

The function extract_name will also be required for setting up syscall buffer stuff in coming commits.

Need to be able to set this data when restoring an address space.

Added persistent checkpoint schema for capnproto rr_pcp.capnp, as well a compile command for it in CMakeLists.txt, that works like the other one (rr_trace.capnp) CheckpointInfo and MarkData types works as intermediaries between a serialized checkpoint and a deserialized "live" one. MarkData is used for copying the contents of Mark, InternalMark, ProtoMark and it's various data into, for serialization as well when deserializing, to reconstruct those types. The reasoning for adding MarkData is to not intrude in Mark/InternalMark/ProtoMark interface and possibly break some guarantees or invariants they provide. If something goes wrong now, it's constrained only to persistent checkpointing not reconstituting a session properly. GDB spawned by RR now has 2 additional commands, write-checkpoints, which serializes any checkpoints set by the `checkpoint` command and load-checkpoints. Added the rr create-checkpoints command which create persistent checkpoints on a specified interval, which it attempts to honor as closely as possible. RerunCommand and ReplayCommand are now aware of PCPs. Replay sessions get spawned from persistent checkpoints if they exist on disk when using `-g <evt>` or when using `-f <pid>` and that "task" was created some time after a persistent checkpoint. Added the --ignore-pcp flag to these commands, which ignores pcps and spawns sessions normally.

Restored comments, that existed in static function in ReplaySession.cc Change all use of this to Event::can_checkpoint_at Removed static can_checkpoint_at in ReplaySession.cc

Since checkpoints are partially initialized, checking that they are is pointless.

Deserializing and serializing an FdTable is now performed by the class itself instead of in a free function FileMonitor has a public member function that is used for serialization. Each derived type that requires special/additional logic, extends the virtual member function serialize_type.

not necessary for serialization, as FdTable is separately restored.

Task::copy_state sets the OS name of a task in the same fashion that persistent checkpointing sets name. Refactored this functionality into Task::set_name. Also removed the unnecessary `update_prname` from Task::copy_state. update_prname is not a "write to tracee"-operation but a "read from tracee"-operation; and since we already know what name we want to set Task::prname to, we skip this reading from the tracee in Task::copy_state and just set it to the parameter passed in to Task::set_name

Refactor so that marks_with_checkpoints is just changed in one place, not arbitrarily access it. Ref counts had the same changes in a previous commit. Fixes a bug for loaded persistent checkpoints where the re-created checkpoints did not get their reference counting correct. This closes rr-debugger#3678

We don't save the names, since persistent checkpoints rr-debugger/rr#3406 haven't been fully implemented yet. When it gets finished we can think about having persistent names as well.

khuey · 2024-01-29T05:50:48Z

Thanks so much for continuing to work on this. @rocallahan and I took a look at this this weekend. Rather than do a bunch of nitpicking at this point we have a few high level questions/comments

This supports checkpointing at arbitrary marks. How much simpler would things be if we only supported checkpointing at event boundaries? We suspect what you've done is the right thing but we're curious how much you think things could be simplified if we only allowed checkpoints at events.
Do you anticipate any barriers to making the checkpoint format stable at some point in the future? Obviously we'd want it to be tested much more before we committed to a checkpoint format, but at some point stability of the checkpoint format would be valuable to other tools (e.g. Pernosco).
Why is the on-disk format structured so that there's a global index of metadata and separate files for some but not all checkpoint data as opposed to having a single self-contained file for each checkpoint and no global index?
Can we get at least a basic test that the checkpoint feature works? We're not expecting exhaustive tests at this stage but more than zero would be nice?
The current commit structure seems to be of little value. Are we wrong? If not, perhaps organize the PR as follows: FileMonitor changes separately in one commit, UI commands including both rr and gdb commands separately in commits as necessary, separate tests, otherwise squash all the commits.

Our initial read through the patch yielded a lot of relatively minor comments. The general approach seems sound, modulo perhaps the checkpoint data format question above.

theIDinside · 2024-01-30T10:08:03Z

This supports checkpointing at arbitrary marks. How much simpler would things be if we only supported checkpointing at event boundaries? We suspect what you've done is the right thing but we're curious how much you think things could be simplified if we only allowed checkpoints at events.

I think I need more clarification for this question. When rr create-checkpoints executes, we basically just move forward in time and as soon as we've crossed some interval and reached an event boundary where we can set a checkpoint we do so. Or do you mean, removing the ability to make "gdb checkpoints" persistent; i.e. not keeping around the mark-with-clone and the mark-without-clone, which we then seek to at restore? This would indeed simplify things and it would also make any future format cleaner.

Do you anticipate any barriers to making the checkpoint format stable at some point in the future? Obviously we'd want it to be tested much more before we committed to a checkpoint format, but at some point stability of the checkpoint format would be valuable to other tools (e.g. Pernosco).

I think the format can be fairly stabilized. But even if it couldn't - could this not be solved via some indirection in the format, or is this impossible? For instance, with the current PR as a reference, it would mean that CheckpointInfo could be the format that's exposed to external tools, whereas the internal representation of the serialized tracee data is described by CloneCompletionInfo, so in a way having a secondary format for the actual tracee process space data, that's consumed by the "persistent checkpoint" format, if that makes any sense? I'm not sure if that's doable or if that would be considered a stable checkpoint format?

In the future, we probably (maybe) don't want to dump the entire process to disk, every time we serialize. We would need some way to represent this incremental state, something that probably is not as interesting to external tools and I'm guessing it would break the format too.

Why is the on-disk format structured so that there's a global index of metadata and separate files for some but not all checkpoint data as opposed to having a single self-contained file for each checkpoint and no global index?

I have no good explanation here 😛. The format needs to describe 2 things, ultimately

supervisor state (like statistics, ticks, if we're using syscall buffering etc etc etc)
actual tracee data & metadata (memory mappings, the contents)

I'll simplify this so that it doesn't get spread out across different files. If I recall it had something to do with being able to remove individual checkpoints more easily. Or maybe I modelled it such that it would be simpler for me to manage checkpoints from the perspective of GDB, I can't remember really.

Can we get at least a basic test that the checkpoint feature works? We're not expecting exhaustive tests at this stage but more than zero would be nice?

This is a tricky one and one I've thought about a lot. I keep coming back to having GDB be driven by a Python script, which does the following:

Decides a range of events where it wants to "test the world", so events N .. M
Connects to a running RR instance
Set checkpoint at N.
"Record the world" (memory mappings, contents of them, everything)
Continue to M
Delete checkpoint
Load checkpoint from disk, repeat 3-5
Compare outputs from step 4

Since RR is deterministic with it's replays, comparing the output should be "trivial", or am I being too heavy-handed here? Would something like that be considered acceptable?

The current commit structure seems to be of little value. Are we wrong? If not, perhaps organize the PR as follows: FileMonitor changes separately in one commit, UI commands including both rr and gdb commands separately in commits as necessary, separate tests, otherwise squash all the commits.

I'll re-arrange it, like suggested.

rocallahan · 2024-02-02T05:01:11Z

I think I need more clarification for this question. When rr create-checkpoints executes, we basically just move forward in time and as soon as we've crossed some interval and reached an event boundary where we can set a checkpoint we do so. Or do you mean, removing the ability to make "gdb checkpoints" persistent; i.e. not keeping around the mark-with-clone and the mark-without-clone, which we then seek to at restore? This would indeed simplify things and it would also make any future format cleaner.

The latter.

TBH I think it's probably worth having the ability to create checkpoints at arbitrary moments in time, since the amount of work executed between events can be arbitrarily large. But there is a tradeoff here which we wanted to think about.

For instance, with the current PR as a reference, it would mean that CheckpointInfo could be the format that's exposed to external tools, whereas the internal representation of the serialized tracee data is described by CloneCompletionInfo, so in a way having a secondary format for the actual tracee process space data, that's consumed by the "persistent checkpoint" format, if that makes any sense? I'm not sure if that's doable or if that would be considered a stable checkpoint format?

I'm not sure what you mean here. "Stable checkpoint format" would mean that persistent checkpoints created by rr version X can be restored by any rr version >= X.

That would be good, although I suppose it's not as important as just being able to replay the trace, since you can always restore a checkpoint very slowly by just replaying to the right point.

I'll simplify this so that it doesn't get spread out across different files. If I recall it had something to do with being able to remove individual checkpoints more easily. Or maybe I modelled it such that it would be simpler for me to manage checkpoints from the perspective of GDB, I can't remember really.

To be clear, one file per checkpoint seems like a good idea because then it's easy to add and remove checkpoints efficiently. The question is whether that index file is a good idea. One issue with it: what happens if someone tries to concurrently create checkpoints, or adds a checkpoint at the same time as replaying?

Since RR is deterministic with it's replays, comparing the output should be "trivial", or am I being too heavy-handed here? Would something like that be considered acceptable?

I think for now all we really need is a debug script that starts a program, runs it to some point, creates a checkpoint, and exits, plus another debug script that restores the checkpoint and tests that some simple state is as it should be. Doesn't need to be very comprehensive or general. Over time we'll have to add more tests like this that test different parts of the state and we might want some utility functions in util.sh to help with that, but I don't think we need that just yet.

theIDinside changed the title ~~Pcp pr~~ Persistent Checkpointing PR (#2184) Dec 1, 2022

khuey reviewed Dec 30, 2022

View reviewed changes

theIDinside added 7 commits January 10, 2024 12:03

Functionality that will be shared, moved from TraceStream.cc

743c719

- Moved into util.cc - Added forward_to to skip trace data to some arbitrary point in time

Getters required to expose data

dd76fad

We need to be able to expose this data so it can be serialized.

Additional proc fs query paths

cc41c3f

Gets additional proc fs paths for a task, in this case /mem. Required for persistent checkpointing to figure out on how to handle mappings and what to serialize (and what not to serialize).

Lifted CloneCompletion out of Session

9be51d2

The function extract_name will also be required for setting up syscall buffer stuff in coming commits.

Getters/setters required for PCP

9e7e27e

Need to be able to set this data when restoring an address space.

theIDinside mentioned this pull request Jan 12, 2024

GDB Checkpoint Issue #3678

Open

theIDinside added 13 commits January 15, 2024 10:47

fixup for can_checkpoint_at

e9460c7

Restored comments, that existed in static function in ReplaySession.cc Change all use of this to Event::can_checkpoint_at Removed static can_checkpoint_at in ReplaySession.cc

Fix preferred include & unnecessary check for partial init

085d5c3

Since checkpoints are partially initialized, checking that they are is pointless.

Added cmake command looping over trace files per request by @khuey

42f1e29

remove init check of member variables.

45b3abf

Move extract_name from Session into util.h.

2c4b0b2

Removed stream_util, moved contents to util.h

436e241

make ignore-pcp not take up '-i'

71805fc

Remove skipMonitoringMappedFd

6c803d4

not necessary for serialization, as FdTable is separately restored.

Add const qualifier

bc524da

theIDinside force-pushed the pcp-pr branch from b29ff57 to f1d5e90 Compare January 15, 2024 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent Checkpointing PR (#2184) #3406

Persistent Checkpointing PR (#2184) #3406

theIDinside commented Dec 1, 2022 •

edited

Loading

khuey left a comment

khuey Dec 30, 2022

theIDinside Mar 7, 2023

khuey Dec 30, 2022

theIDinside Mar 7, 2023

khuey Dec 30, 2022

khuey commented Apr 21, 2023

rocallahan commented Apr 21, 2023

theIDinside commented Apr 24, 2023

khuey commented Oct 11, 2023

theIDinside commented Oct 24, 2023

GitMensch commented Dec 27, 2023

khuey commented Jan 29, 2024

theIDinside commented Jan 30, 2024

rocallahan commented Feb 2, 2024

Persistent Checkpointing PR (#2184) #3406

Are you sure you want to change the base?

Persistent Checkpointing PR (#2184) #3406

Conversation

theIDinside commented Dec 1, 2022 • edited Loading

khuey left a comment

Choose a reason for hiding this comment

khuey Dec 30, 2022

Choose a reason for hiding this comment

theIDinside Mar 7, 2023

Choose a reason for hiding this comment

khuey Dec 30, 2022

Choose a reason for hiding this comment

theIDinside Mar 7, 2023

Choose a reason for hiding this comment

khuey Dec 30, 2022

Choose a reason for hiding this comment

khuey commented Apr 21, 2023

rocallahan commented Apr 21, 2023

theIDinside commented Apr 24, 2023

khuey commented Oct 11, 2023

theIDinside commented Oct 24, 2023

GitMensch commented Dec 27, 2023

khuey commented Jan 29, 2024

theIDinside commented Jan 30, 2024

rocallahan commented Feb 2, 2024

theIDinside commented Dec 1, 2022 •

edited

Loading