Skip to content

Commit

Permalink
Create a checkpoint at step 0 for distributed GBT (i.e., after initia…
Browse files Browse the repository at this point in the history
…lization and before training).

PiperOrigin-RevId: 704297003
  • Loading branch information
achoum authored and copybara-github committed Dec 9, 2024
1 parent b43657d commit 50e3ef7
Showing 1 changed file with 6 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,8 @@ TrainWithCache(
if (!resync_iter_idx_status.ok()) {
LOG(WARNING) << "No existing snapshot. Restart training from start.";
// TODO: Restart training without rebooting the trainer.
return absl::CancelledError(
"A worker was restarted before any checkpoint was done.");
}
auto resync_iter_idx = resync_iter_idx_status.value();

Expand Down Expand Up @@ -938,6 +940,10 @@ absl::Status RestoreManagerCheckpoint(
bool ShouldCreateCheckpoint(
int iter_idx, const absl::Time& time_last_checkpoint,
const proto::DistributedGradientBoostedTreesTrainingConfig& spe_config) {
if (iter_idx == 0) {
return true;
}

if (spe_config.checkpoint_interval_trees() >= 0 &&
(iter_idx % spe_config.checkpoint_interval_trees()) == 0) {
return true;
Expand Down

0 comments on commit 50e3ef7

Please sign in to comment.