Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log progress on network migrations #12259

Open
rjan90 opened this issue Jul 18, 2024 · 11 comments
Open

Log progress on network migrations #12259

rjan90 opened this issue Jul 18, 2024 · 11 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@rjan90
Copy link
Contributor

rjan90 commented Jul 18, 2024

Done Criteria:

  • Implement a progress bar to display the status of pre-migrations and migrations during network upgrades.
  • Ensure the progress bar provides updates on the number of jobs created, completed, and pending.

Why Important:

  • During complex or longer-duration pre-migrations and migrations, it is difficult to estimate the time required to complete the process, leading to user confusion.
  • A progress bar will help users understand the migration progress and estimate the time remaining, improving the overall user experience.
  • This is particularly important for full archival nodes, which take longer to complete migrations. The progress bar will assist in communicating to users when their service will be back online.

User/Customer:

  • All Lotus users, especially full archival nodes.

Example Output from a Migration:

024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 27 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 26 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 29 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 3 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 36 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 19 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 21 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 8 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 12 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 17 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 11 done
2024-07-18T11:06:27.005+0200    INFO    fil-consensus    filcns/upgrades.go:2740    Worker 18 done
@rjan90 rjan90 added this to FilOz Jul 18, 2024
@rjan90 rjan90 moved this to 📌 Triage in FilOz Jul 18, 2024
@rvagg
Copy link
Member

rvagg commented Jul 22, 2024

Progress bar is only going to be useful if you're running Lotus in a terminal; but we could probably couple this with just improving these log messages because they're so opaque. I don't even know what these mean or how to connect them to any sense of "progress". We can do better I think.

@rjan90 rjan90 changed the title Progress Bar for network migrations Log progress on for network migrations Jul 23, 2024
@rjan90 rjan90 changed the title Log progress on for network migrations Log progress on network migrations Jul 23, 2024
@rjan90 rjan90 added the good first issue Good for newcomers label Jul 23, 2024
@rjan90
Copy link
Contributor Author

rjan90 commented Jul 23, 2024

From triage today:

  • This could be a good first issue for one of the PL Dev Guild member to pick up if we spec out what needs to changes/add pointers to the code.

@rjan90 rjan90 moved this from 📌 Triage to 🐱Todo in FilOz Jul 23, 2024
@rvagg
Copy link
Member

rvagg commented Nov 20, 2024

We currently have this in the migration: https://github.com/filecoin-project/go-state-types/blob/9a76026713cdf92be72a085c78a041ed3aa100af/migration/runner.go#L103C9-L115

And we set it to 10s here:

ProgressLogPeriod: 10 * time.Second,
}
(and in other upgrades)

So we should be seeing info logs every 10 seconds with X jobs created, Y done, Z pending after T (xx.yy/s).

Maybe the question for this issue is: why aren't we seeing these?

@rvagg
Copy link
Member

rvagg commented Nov 21, 2024

OK, so now we've had the migration:

$ tail -1000000 /var/log/lotus/daemon.log | grep 'pending after'
$ tail -1000000 /var/log/lotus/daemon.log | grep 'migrat'
{"level":"info","ts":"2024-11-19T15:26:39.959+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:91","msg":"migration worker count: 32"}
{"level":"info","ts":"2024-11-19T16:02:34.496+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:91","msg":"migration worker count: 32"}
{"level":"info","ts":"2024-11-19T18:29:44.860+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:93","msg":"migration worker count: 32"}
{"level":"info","ts":"2024-11-19T18:30:35.638+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:93","msg":"migration worker count: 32"}
{"level":"warn","ts":"2024-11-21T09:00:00.435+1100","logger":"statemgr","caller":"stmgr/forks.go:250","msg":"STARTING pre-migration"}
{"level":"info","ts":"2024-11-21T09:00:00.498+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:2887","msg":"Creating migration jobs"}
{"level":"info","ts":"2024-11-21T09:00:49.803+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:2887","msg":"Done creating 3251932 migration jobs after 49.305722065s"}
{"level":"warn","ts":"2024-11-21T09:00:53.791+1100","logger":"statemgr","caller":"stmgr/forks.go:263","msg":"COMPLETED pre-migration","duration":53.35572384}
{"level":"warn","ts":"2024-11-21T10:00:34.896+1100","logger":"statemgr","caller":"stmgr/forks.go:202","msg":"STARTING migration","height":"4461240","from":"bafy2bzacealnnuali3ftodbzsug4zae66r46pahwt2t4k5gmcmc6vjwarzulc"}
{"level":"info","ts":"2024-11-21T10:00:34.897+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:2887","msg":"Creating migration jobs"}
{"level":"info","ts":"2024-11-21T10:00:43.111+1100","logger":"fil-consensus","caller":"filcns/upgrades.go:2887","msg":"Done creating 3252200 migration jobs after 8.213846102s"}
{"level":"warn","ts":"2024-11-21T10:00:46.250+1100","logger":"statemgr","caller":"stmgr/forks.go:211","msg":"COMPLETED migration","height":"4461240","from":"bafy2bzacealnnuali3ftodbzsug4zae66r46pahwt2t4k5gmcmc6vjwarzulc","to":"bafy2bzacebw2hbs4kvdwr2q342v33arw2dirfjpvrxw3v4ttxry2lck6wsves","duration":11.353472059}

So we don't get any of those log.Log(rt.INFO, "%d jobs created, %d done, %d pending after %v (%.0f/s)", jobsNow, doneNow, pendingNow, elapsed, rate) logs. Why not? That's what we need to figure out.

I've asked @virajbhartiya to have a look at this and see if he can understand what might be the blocker and maybe we can make it better for the next upgrade. Here's some pointers:

upgradeActorsV15Common calls nv24.MigrateStateTree here:

newHamtRoot, err := nv24.MigrateStateTree(
ctx,
adtStore,
manifest,
stateRoot.Actors,
epoch,
// two FIP-0081 constants for this migration only
int64(buildconstants.UpgradeTuktukHeight), // powerRampStartEpoch
buildconstants.UpgradeTuktukPowerRampDurationEpochs, // powerRampDurationEpochs
config,
migrationLogger{},
cache,
)
if err != nil {

The migrationLogger{} is worth looking at, that's where the log output goes so it should show up as logging from filcns/upgrades.go even though this code is in go-state-types.

That nv24.MigrateStateTree code is here: https://github.com/filecoin-project/go-state-types/blob/9a76026713cdf92be72a085c78a041ed3aa100af/builtin/v15/migration/top.go#L27

And the important bit is where it calls migration.RunMigration and goes into here: https://github.com/filecoin-project/go-state-types/blob/9a76026713cdf92be72a085c78a041ed3aa100af/migration/runner.go#L19

That's where the logging happens, and it's the if cfg.ProgressLogPeriod > 0 { block where we want to get progress output from.

So back to lotus, upgradeActorsV15Common got called from 2 places: the pre-migration and the actual migration. ProgressLogPeriod is set in both, but differently:

  1. Pre-migration:
    ProgressLogPeriod: time.Minute * 5,
  2. Actual migration:
    ProgressLogPeriod: 10 * time.Second,

I think that the answer to this might just be that the timing is too low. My duration is 11.353472059 seconds, so it might not have had time to tick over one progress output. The pre-migration has it set to 5 minutes so that probably explains why I didn't see it!

There's a therad in slack with some others posting their greps: https://filecoinproject.slack.com/archives/C027TQMUVJN/p1732143841560979, @TippyFlitsUK's took particularly long because he had a splitstore compaction. So he might have logs that show 'pending after'.

So @virajbhartiya the things we need to do with this I think are:

  1. Investigate whether the explanation is simply that it happened to quick - if we can find evidence of it printing in others' logs then that's a good sign that there's not a code problem here. If you could chase up some logs that would be good - alternatively sync a node from a snapshot just before upgrade and watch it upgrade for yourself.
  2. Let's speed up the printing for future upgrades. 5 minutes is silly for pre-migration, let's lower that quite a bit. Then the actual migration could also be sped up. How about once every 2 seconds. If a migration takes 30 seconds maximum, then that's only 15 log lines and it's worth it for the knowledge that it's still working. For now, since we don't have a v16 in lotus, let's modify the v15 code because it's going to be copied when the next person sets that up (soon, unless there's a draft PR already open for it?).

@rvagg
Copy link
Member

rvagg commented Nov 21, 2024

Oh, and I think the message should be improved. Think about it from a user perspective - they don't really care how many jobs there are, they care about % complete. So we should at least add a % to the output so they don't have to figure it out for themselves.

@virajbhartiya
Copy link
Member

Thanks @rvagg for the logs and for providing the reference of where to start, I'll begin working on it

@virajbhartiya
Copy link
Member

@rvagg If I am correct these are the pending after logs you are referring to right in the comment above

2024-11-20T23:00:44.218Z	INFO	fil-consensus	filcns/upgrades.go:2887	172924 jobs created, 172924 done, 0 pending after 10.00028947s (17292/s)
2024-11-20T23:00:54.218Z	INFO	fil-consensus	filcns/upgrades.go:2887	376092 jobs created, 376091 done, 1 pending after 20.000394023s (18804/s)
2024-11-20T23:01:04.218Z	INFO	fil-consensus	filcns/upgrades.go:2887	548324 jobs created, 548323 done, 1 pending after 30.000615333s (18277/s)
2024-11-20T23:01:14.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	748161 jobs created, 748161 done, 0 pending after 40.000767177s (18704/s)
2024-11-20T23:01:24.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	944014 jobs created, 944013 done, 1 pending after 50.001019858s (18880/s)
2024-11-20T23:01:34.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	1151764 jobs created, 1151764 done, 0 pending after 1m0.001234484s (19196/s)
2024-11-20T23:01:44.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	1348690 jobs created, 1348690 done, 0 pending after 1m10.001301533s (19267/s)
2024-11-20T23:01:54.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	1559801 jobs created, 1559801 done, 0 pending after 1m20.001362138s (19497/s)
2024-11-20T23:02:04.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	1763707 jobs created, 1763707 done, 0 pending after 1m30.001508059s (19596/s)
2024-11-20T23:02:14.219Z	INFO	fil-consensus	filcns/upgrades.go:2887	1957939 jobs created, 1957938 done, 1 pending after 1m40.001551335s (19579/s)
2024-11-20T23:02:24.220Z	INFO	fil-consensus	filcns/upgrades.go:2887	2170918 jobs created, 2170918 done, 0 pending after 1m50.002030258s (19735/s)
2024-11-20T23:02:34.220Z	INFO	fil-consensus	filcns/upgrades.go:2887	2361162 jobs created, 2361162 done, 0 pending after 2m0.002200215s (19676/s)
2024-11-20T23:02:44.220Z	INFO	fil-consensus	filcns/upgrades.go:2887	2568650 jobs created, 2568650 done, 0 pending after 2m10.002248466s (19759/s)
2024-11-20T23:02:54.220Z	INFO	fil-consensus	filcns/upgrades.go:2887	2773370 jobs created, 2773370 done, 0 pending after 2m20.002331222s (19809/s)
2024-11-20T23:03:04.220Z	INFO	fil-consensus	filcns/upgrades.go:2887	2979342 jobs created, 2979342 done, 0 pending after 2m30.002692406s (19862/s)
2024-11-20T23:03:14.221Z	INFO	fil-consensus	filcns/upgrades.go:2887	3163568 jobs created, 3163568 done, 0 pending after 2m40.003382837s (19772/s)

@BigLep BigLep moved this from 🐱 Todo to ⌨️ In Progress in FilOz Nov 21, 2024
@rvagg
Copy link
Member

rvagg commented Nov 22, 2024

Excellent, yes those are the logs @virajbhartiya, thanks for hunting those down. So we know they are showing up. I think the two things remaining here are:

  1. See if we can make the logs more interesting

Looking at them now it's clear that we're creating jobs at the same time as processing them, so we don't get to do a %. But let's see if we can make it more interesting for the user that's wondering "Is my migration still working, and is it going OK??". Some ideas off the top of my head looking at this:

a. We should probably add the word "migration" in there, start with "Performing migration" perhaps?
b. The job numbers aren't that interesting really, but they're the only thing we have, we could ditch the "pending" number, it's not that useful unless it was specific about what was pending which is hard: "X of Y jobs complete (Z/s)".
c. The precision of the time is too much. I don't think we need a decimal for seconds.

  1. Make them come quicker

Once every 10 minutes is far too slow here, we should see it ticking away. I'd be happy with a number between 2 and 5 seconds. For a 30 second migration that would be between 15 and 6 log entries. For a worst case of 10 minutes it'd be between 300 and 120 log entries. Is that bad? I don't think so. @rjan90 you're a log watcher, got an opinion on how much spam is too much for this progress output?

Next, where to do that timing. I think you could start by doing it in the v15 migration, but @kamuik16 has already opened a skeleton PR @ #12707 which would need the changes in the v16 block. So we need to coordinate on getting that done. Either we land v15 changes and get @kamuik16 to merge them in, or we wait till @kamuik16's PR is landed and you just do v16, or you PR to @kamuik16's branch but then you don't get a commit in the git log for your efforts.

@kamuik16
Copy link
Contributor

Excellent, yes those are the logs @virajbhartiya, thanks for hunting those down. So we know they are showing up. I think the two things remaining here are:

  1. See if we can make the logs more interesting

Looking at them now it's clear that we're creating jobs at the same time as processing them, so we don't get to do a %. But let's see if we can make it more interesting for the user that's wondering "Is my migration still working, and is it going OK??". Some ideas off the top of my head looking at this:

a. We should probably add the word "migration" in there, start with "Performing migration" perhaps?
b. The job numbers aren't that interesting really, but they're the only thing we have, we could ditch the "pending" number, it's not that useful unless it was specific about what was pending which is hard: "X of Y jobs complete (Z/s)".
c. The precision of the time is too much. I don't think we need a decimal for seconds.

  1. Make them come quicker

Once every 10 minutes is far too slow here, we should see it ticking away. I'd be happy with a number between 2 and 5 seconds. For a 30 second migration that would be between 15 and 6 log entries. For a worst case of 10 minutes it'd be between 300 and 120 log entries. Is that bad? I don't think so. @rjan90 you're a log watcher, got an opinion on how much spam is too much for this progress output?

Next, where to do that timing. I think you could start by doing it in the v15 migration, but @kamuik16 has already opened a skeleton PR @ #12707 which would need the changes in the v16 block. So we need to coordinate on getting that done. Either we land v15 changes and get @kamuik16 to merge them in, or we wait till @kamuik16's PR is landed and you just do v16, or you PR to @kamuik16's branch but then you don't get a commit in the git log for your efforts.

I am fine with anything. Yours and @virajbhartiya's call.

@kamuik16
Copy link
Contributor

Thou I think it would be best after my PR lands :)

@virajbhartiya
Copy link
Member

We could wait for @kamuik16 PR to land then get started with modifications in v16. Is that okay @rvagg ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: ⌨️ In Progress
Development

No branches or pull requests

4 participants