-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313
base: master
Are you sure you want to change the base?
[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style comments. Will do second pass again
} | ||
updateFunctionalIndexIfPresent(commitMetadata, instantTime, partitionToRecordMap); | ||
updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, writeStatus); | ||
updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, instantTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come we still have the updateFromWriteStatuses
method..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still rename writeStatus
to writeStatuses
. plural
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have punted below changes to next patch.
a. Remove HoodieWriteDelegate from WriteStatus
b. Remove the update api in HoodieMetadataWriter totally.
Current patch ensures none of MDT record generation uses the RDD<.WriteStatus>.
I can incorporate the b in this patch if you prefer it.
@@ -159,6 +159,9 @@ public class HoodieWriteStat implements Serializable { | |||
@Nullable | |||
private Long maxEventTime; | |||
|
|||
@Nullable | |||
private String prevBaseFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you just need the basefile? not the entire previous file slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HoodieDeltaWriteStat contains the log files. So, HoodieWriteStat only contains info about base files.
@@ -42,16 +43,19 @@ | |||
public class HoodieUnMergedLogRecordScanner extends AbstractHoodieLogRecordScanner { | |||
|
|||
private final LogRecordScannerCallback callback; | |||
private final CallbackForDeletedKeys callbackForDeletedKeys; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename: DeletionCallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
* @return Iterator of {@link HoodieRecord}s for RLI Metadata partition. | ||
* @throws IOException | ||
*/ | ||
public static Iterator<HoodieRecord> generateRLIMetadataHoodieRecordsForBaseFile(String basePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UT? all methods in file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. we already have it for the RLI methods. I will be updating the patch w/ more tests for the rest (SI related ones).
|
||
import static java.util.stream.Collectors.toList; | ||
|
||
public class BaseFileRecordParsingUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a new Utils class? adding lots of short utils classes, makes the code harder to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to keep expanding HoodieMetadataTableUti. Its already close to 3000 lines.
So, kept all record level parsing method to this new class.
HoodieCommitMetadata commitMetadata, | ||
HoodieMetadataConfig metadataConfig, | ||
HoodieTableMetaClient dataTableMetaClient, | ||
int writesFileIdEncoding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method bit long, IMO. we need small methods that can be easily tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. will split it up into smaller ones
} | ||
}, parallelism).values(); | ||
} catch (Exception e) { | ||
throw new HoodieException("Failed to generate column stats records for metadata table", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column stats records -> RLI records?
} | ||
}).collectAsList(); | ||
} catch (Exception e) { | ||
throw new HoodieException("Failed to generate column stats records for metadata table", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error msg is incorrect.
.withLogRecordScannerCallbackForDeletedKeys(deletedKey -> deletedKeys.add(deletedKey.getRecordKey())) | ||
.build(); | ||
scanner.scan(); | ||
return deletedKeys.stream().collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just initialize the deletedKeys
as a set.
Change Logs
SI record generation is of two steps:
a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records.
b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.
Among the above steps, (a) is the one which was relying on WriteStatus.
In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files.
Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.
This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.
Impact
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist