Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8533] Bloom functional Index creation without function fails #12290

Merged
merged 10 commits into from
Nov 22, 2024

Conversation

lokeshj1703
Copy link
Contributor

Change Logs

Allow creation of functional index without an input function. The PR uses a default function in such a case.

Impact

NA

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@lokeshj1703 lokeshj1703 changed the title [HUDI-8533] Bloom Index creation without function fails [HUDI-8533] Bloom functional Index creation without function fails Nov 19, 2024
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Nov 19, 2024
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic question here.. Should this case - bloom_filters without a function defined be handled by BloomFiltersIndexSupport instead of adding an identity function as default.. and making everything treated as functional index?

(Code maintainability wise, I like treating everything as having a function, using identity as needed, including secondary_index. but the functional index code path has a bunch of restrictions?)

dataMetaClient.getTableConfig().setMetadataPartitionState(dataMetaClient, partitionPath, false);
if (MetadataPartitionType.isGenericIndex(partitionPath)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unit tested?

@@ -90,6 +90,7 @@
import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
import static org.apache.hudi.common.util.ValidationUtils.checkState;
import static org.apache.hudi.index.functional.HoodieFunctionalIndex.SPARK_IDENTITY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called SPARK_IDENTITY .. and in hudi-common.. this module must be strictly engine free

@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Nov 19, 2024
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some work. I suggest we first focus on addressing #12266 and land that, to make this smaller.. and fix remaining issues.

@@ -150,21 +151,22 @@ public static HoodieData<HoodieRecord> getFunctionalIndexRecordsUsingColumnStats
}

public static HoodieData<HoodieRecord> getFunctionalIndexRecordsUsingBloomFilter(Dataset<Row> dataset, String columnToIndex,
HoodieWriteConfig metadataWriteConfig, String instantTime) {
HoodieWriteConfig metadataWriteConfig, String instantTime, String indexName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we UT this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

*
* @param indexName Name of the index
*/
public void deleteIndexDefinition(String indexName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. UT please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

if (!dataMetaClient.getTableConfig().isMetadataPartitionAvailable(MetadataPartitionType.BLOOM_FILTERS)) {
LOG.error("Metadata bloom filter index is disabled!");
if (!dataMetaClient.getTableConfig().getMetadataPartitions().contains(metadataPartitionName)) {
LOG.error("Metadata partition not found {}", metadataPartitionName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make the error message more generic vs stating specifically. that bloom filter index is what was disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@@ -312,6 +312,18 @@ private static void constructColumnStatsMetadataPayload(HoodieMetadataPayload pa
}
}

public static boolean isGenericIndex(String metadataPartitionPath) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually find this confusing. a generic index is sth that is either using a function to index or a secondary index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixable in this PR i think

Some(getCandidateFiles(indexDf, queryFilters, prunedFileNames))
} else if (indexDefinition.getIndexType.equals(HoodieTableMetadataUtil.PARTITION_NAME_BLOOM_FILTERS)) {
val prunedPartitionAndFileNames = getPrunedPartitionsAndFileNamesMap(prunedPartitionsAndFileSlices, includeLogFiles = true)
// TODO: we should get the list of keys to pass from queryFilters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@@ -198,6 +209,44 @@ class FunctionalIndexSupport(spark: SparkSession,

columnStatsRecords
}

private def getPrunedPartitionsAndFileNamesMap(prunedPartitionsAndFileSlices: Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these package private/protected to get some UTs on them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Seq("trip2", "rider-C")
)

if (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lokeshj1703 this was from my debugging.. :) please remove and cleanup.. Add calls to verifyQueryPredicate to verify query correctness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

}

@ParameterizedTest
@EnumSource(value = classOf[HoodieTableType])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@lokeshj1703
Copy link
Contributor Author

I have manually verified that file pruning is working

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One minor test comment.

lokeshj1703 and others added 8 commits November 22, 2024 12:31
 - Take the new tests in TestSecondaryIndexSupport, need to add query predicate verification asserts
 - See changes in both production files, around fixing one-off errors and wrong filepath/filename to build bloom index.
   - This needs to be done in a clean way
 - See changes in TestFunctionalIndex to unignore it and have it passing. take these changes.
 - Ignore my comments on memory management for now.
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit ae5833e into apache:master Nov 22, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants