-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Deletion of Previous Metadata and Statistics Files #312
Changes from 8 commits
d0cd456
26e03ac
0f8e8f4
1b525de
e8b26d2
2ee6dee
806f46d
4f1d3c9
0a77bfa
47dc60a
9d835b3
40c6147
f354d1c
88c6651
af3efab
278ab7e
ed30fb0
05c3dd9
49dbe68
8eea50d
d9804e6
54511de
56ba4f2
e92852e
27ea1b3
4d1b68b
eb533d7
47f760f
988e530
4965d5c
5f81483
097189c
651ece0
16bb5fe
d276ae6
187b47e
ba5c47c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,8 @@ | |
import java.util.function.Function; | ||
import java.util.stream.Collectors; | ||
import org.apache.iceberg.ManifestFile; | ||
import org.apache.iceberg.Snapshot; | ||
import org.apache.iceberg.StatisticsFile; | ||
import org.apache.iceberg.TableMetadata; | ||
import org.apache.iceberg.TableMetadataParser; | ||
import org.apache.iceberg.io.FileIO; | ||
|
@@ -158,6 +160,22 @@ public boolean handleTask(TaskEntity cleanupTask) { | |
for (PolarisBaseEntity createdTask : createdTasks) { | ||
taskExecutor.addTaskHandlerContext(createdTask.getId(), CallContext.getCurrentContext()); | ||
} | ||
|
||
tableMetadata.snapshots().stream() | ||
.flatMap(sn -> sn.allManifests(fileIO).stream()) | ||
// remove duplication | ||
.collect(Collectors.toMap(ManifestFile::path, Function.identity(), (mf1, mf2) -> mf1)) | ||
.keySet() | ||
.forEach(fileIO::deleteFile); | ||
tableMetadata.snapshots().stream() | ||
.map(Snapshot::manifestListLocation) | ||
.forEach(fileIO::deleteFile); | ||
tableMetadata.previousFiles().stream() | ||
.map(TableMetadata.MetadataLogEntry::file) | ||
.forEach(fileIO::deleteFile); | ||
tableMetadata.statisticsFiles().stream() | ||
.map(StatisticsFile::path) | ||
.forEach(fileIO::deleteFile); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should either add retries for these or submit them as separate tasks. As is, if one of these files fails to delete, we'll retry the whole task and resubmit a new task for each manifest. If the manifests are already deleted when we retry, we'll get stuck in a retry loop. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, Michael! I could refactor the deletion process by adding retry strategy, similar to the Personally, I think using a retry mechanism here might be more effective than creating separate tasks. With separate tasks,, there’s still a risk of issues like task creation failures, which could result in skipping the entire task (which contains multiple files). By using retries within the ManifestFileCleanupTaskHandler, we can manage failure handling at the file level, ensuring that each file is retried independently. This way, if a file deletion fails, we can retry just that file without needing to resubmit or skip the others files. This approach can offers more granular control over handling failures. I’m open to your thoughts on this! Does this seem aligned with what you were suggesting, or do you see potential advantages in the separate task approach that I might be overlooking? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, do you mean adding retry within the TBH, I'm not very familiar with how Iceberg generates the statistic files. Does there tend to be one per snapshot? one per data file? If the latter, we could be talking about a very large number of files. If that's the case, I think submitting separate tasks with batches of files makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They are per snapshots, https://iceberg.apache.org/spec/#table-statistics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there only one statistics file per snapshot? The spec is not clear:
Unlike the partition statistics file, which is very clear:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be multiple statistics files as the following code shows: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thank you for providing the context, @flyrain :)
@collado-mike Yes, if we go for retries, the logic will be within |
||
fileIO.deleteFile(tableEntity.getMetadataLocation()); | ||
|
||
return true; | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test cases in this file primarily verify theses scenarios:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've already handled the manifest files and manifest list files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing that up! I’ve just noticed that these are already handled in the ManifestFileCleanupTaskHandler. Have removed theses redundant deletion in the latest commit. Appreciate your input.