-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8533] Bloom functional Index creation without function fails #12290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic question here.. Should this case - bloom_filters without a function defined be handled by BloomFiltersIndexSupport
instead of adding an identity function as default.. and making everything treated as functional index?
(Code maintainability wise, I like treating everything as having a function, using identity as needed, including secondary_index. but the functional index code path has a bunch of restrictions?)
dataMetaClient.getTableConfig().setMetadataPartitionState(dataMetaClient, partitionPath, false); | ||
if (MetadataPartitionType.isGenericIndex(partitionPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this unit tested?
@@ -90,6 +90,7 @@ | |||
import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; | |||
import static org.apache.hudi.common.util.ValidationUtils.checkArgument; | |||
import static org.apache.hudi.common.util.ValidationUtils.checkState; | |||
import static org.apache.hudi.index.functional.HoodieFunctionalIndex.SPARK_IDENTITY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this called SPARK_IDENTITY
.. and in hudi-common
.. this module must be strictly engine free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some work. I suggest we first focus on addressing #12266 and land that, to make this smaller.. and fix remaining issues.
@@ -150,21 +151,22 @@ public static HoodieData<HoodieRecord> getFunctionalIndexRecordsUsingColumnStats | |||
} | |||
|
|||
public static HoodieData<HoodieRecord> getFunctionalIndexRecordsUsingBloomFilter(Dataset<Row> dataset, String columnToIndex, | |||
HoodieWriteConfig metadataWriteConfig, String instantTime) { | |||
HoodieWriteConfig metadataWriteConfig, String instantTime, String indexName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we UT this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
* | ||
* @param indexName Name of the index | ||
*/ | ||
public void deleteIndexDefinition(String indexName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. UT please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
if (!dataMetaClient.getTableConfig().isMetadataPartitionAvailable(MetadataPartitionType.BLOOM_FILTERS)) { | ||
LOG.error("Metadata bloom filter index is disabled!"); | ||
if (!dataMetaClient.getTableConfig().getMetadataPartitions().contains(metadataPartitionName)) { | ||
LOG.error("Metadata partition not found {}", metadataPartitionName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make the error message more generic vs stating specifically. that bloom filter index is what was disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
@@ -312,6 +312,18 @@ private static void constructColumnStatsMetadataPayload(HoodieMetadataPayload pa | |||
} | |||
} | |||
|
|||
public static boolean isGenericIndex(String metadataPartitionPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually find this confusing. a generic index is sth that is either using a function to index or a secondary index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not fixable in this PR i think
Some(getCandidateFiles(indexDf, queryFilters, prunedFileNames)) | ||
} else if (indexDefinition.getIndexType.equals(HoodieTableMetadataUtil.PARTITION_NAME_BLOOM_FILTERS)) { | ||
val prunedPartitionAndFileNames = getPrunedPartitionsAndFileNamesMap(prunedPartitionsAndFileSlices, includeLogFiles = true) | ||
// TODO: we should get the list of keys to pass from queryFilters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
@@ -198,6 +209,44 @@ class FunctionalIndexSupport(spark: SparkSession, | |||
|
|||
columnStatsRecords | |||
} | |||
|
|||
private def getPrunedPartitionsAndFileNamesMap(prunedPartitionsAndFileSlices: Seq[(Option[BaseHoodieTableFileIndex.PartitionPath], Seq[FileSlice])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make these package private/protected to get some UTs on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
Seq("trip2", "rider-C") | ||
) | ||
|
||
if (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lokeshj1703 this was from my debugging.. :) please remove and cleanup.. Add calls to verifyQueryPredicate
to verify query correctness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
} | ||
|
||
@ParameterizedTest | ||
@EnumSource(value = classOf[HoodieTableType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
9d0de0c
to
9a7c0e0
Compare
I have manually verified that file pruning is working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One minor test comment.
.../hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestFunctionalIndex.scala
Show resolved
Hide resolved
- Take the new tests in TestSecondaryIndexSupport, need to add query predicate verification asserts - See changes in both production files, around fixing one-off errors and wrong filepath/filename to build bloom index. - This needs to be done in a clean way - See changes in TestFunctionalIndex to unignore it and have it passing. take these changes. - Ignore my comments on memory management for now.
Change Logs
Allow creation of functional index without an input function. The PR uses a default function in such a case.
Impact
NA
Risk level (write none, low medium or high below)
low
Documentation Update
NA
Contributor's checklist