Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Nov 21, 2024

Change Logs

SI record generation is of two steps:
a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records.
b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.

Among the above steps, (a) is the one which was relying on WriteStatus.

In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files.
Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.

This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.

Impact

  • Fixing Secondary Index Record generation to not rely on WriteStatus.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 21, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style comments. Will do second pass again

}
updateFunctionalIndexIfPresent(commitMetadata, instantTime, partitionToRecordMap);
updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, writeStatus);
updateSecondaryIndexIfPresent(commitMetadata, partitionToRecordMap, instantTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come we still have the updateFromWriteStatuses method..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still rename writeStatus to writeStatuses. plural

Copy link
Contributor Author

@nsivabalan nsivabalan Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have punted below changes to next patch.
a. Remove HoodieWriteDelegate from WriteStatus
b. Remove the update api in HoodieMetadataWriter totally.

Current patch ensures none of MDT record generation uses the RDD<.WriteStatus>.

I can incorporate the b in this patch if you prefer it.

@@ -159,6 +159,9 @@ public class HoodieWriteStat implements Serializable {
@Nullable
private Long maxEventTime;

@Nullable
private String prevBaseFile;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you just need the basefile? not the entire previous file slice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieDeltaWriteStat contains the log files. So, HoodieWriteStat only contains info about base files.

@@ -42,16 +43,19 @@
public class HoodieUnMergedLogRecordScanner extends AbstractHoodieLogRecordScanner {

private final LogRecordScannerCallback callback;
private final CallbackForDeletedKeys callbackForDeletedKeys;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: DeletionCallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

* @return Iterator of {@link HoodieRecord}s for RLI Metadata partition.
* @throws IOException
*/
public static Iterator<HoodieRecord> generateRLIMetadataHoodieRecordsForBaseFile(String basePath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT? all methods in file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we already have it for the RLI methods. I will be updating the patch w/ more tests for the rest (SI related ones).


import static java.util.stream.Collectors.toList;

public class BaseFileRecordParsingUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a new Utils class? adding lots of short utils classes, makes the code harder to maintain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to keep expanding HoodieMetadataTableUti. Its already close to 3000 lines.
So, kept all record level parsing method to this new class.

HoodieCommitMetadata commitMetadata,
HoodieMetadataConfig metadataConfig,
HoodieTableMetaClient dataTableMetaClient,
int writesFileIdEncoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be enum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method bit long, IMO. we need small methods that can be easily tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will split it up into smaller ones

}
}, parallelism).values();
} catch (Exception e) {
throw new HoodieException("Failed to generate column stats records for metadata table", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column stats records -> RLI records?

}
}).collectAsList();
} catch (Exception e) {
throw new HoodieException("Failed to generate column stats records for metadata table", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error msg is incorrect.

.withLogRecordScannerCallbackForDeletedKeys(deletedKey -> deletedKeys.add(deletedKey.getRecordKey()))
.build();
scanner.scan();
return deletedKeys.stream().collect(Collectors.toSet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just initialize the deletedKeys as a set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants