Skip to content

Commit

Permalink
apacheGH-39562: [C++][Parquet] Fix crash in test_parquet_dataset_lazy…
Browse files Browse the repository at this point in the history
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
  • Loading branch information
pitrou authored Jan 16, 2024
1 parent a2be302 commit 4b3de81
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
14 changes: 11 additions & 3 deletions cpp/src/arrow/dataset/file_parquet.cc
Original file line number Diff line number Diff line change
Expand Up @@ -813,11 +813,17 @@ Status ParquetFileFragment::EnsureCompleteMetadata(parquet::arrow::FileReader* r

Status ParquetFileFragment::SetMetadata(
std::shared_ptr<parquet::FileMetaData> metadata,
std::shared_ptr<parquet::arrow::SchemaManifest> manifest) {
std::shared_ptr<parquet::arrow::SchemaManifest> manifest,
std::shared_ptr<parquet::FileMetaData> original_metadata) {
DCHECK(row_groups_.has_value());

metadata_ = std::move(metadata);
manifest_ = std::move(manifest);
original_metadata_ = original_metadata ? std::move(original_metadata) : metadata_;
// The SchemaDescriptor needs to be owned by a FileMetaData instance,
// because SchemaManifest only stores a raw pointer (GH-39562).
DCHECK_EQ(manifest_->descr, original_metadata_->schema())
<< "SchemaDescriptor should be owned by the original FileMetaData";

statistics_expressions_.resize(row_groups_->size(), compute::literal(true));
statistics_expressions_complete_.resize(manifest_->descr->num_columns(), false);
Expand Down Expand Up @@ -846,7 +852,8 @@ Result<FragmentVector> ParquetFileFragment::SplitByRowGroup(
parquet_format_.MakeFragment(source_, partition_expression(),
physical_schema_, {row_group}));

RETURN_NOT_OK(fragment->SetMetadata(metadata_, manifest_));
RETURN_NOT_OK(fragment->SetMetadata(metadata_, manifest_,
/*original_metadata=*/original_metadata_));
fragments[i++] = std::move(fragment);
}

Expand Down Expand Up @@ -1106,7 +1113,8 @@ ParquetDatasetFactory::CollectParquetFragments(const Partitioning& partitioning)
format_->MakeFragment({path, filesystem_}, std::move(partition_expression),
physical_schema_, std::move(row_groups)));

RETURN_NOT_OK(fragment->SetMetadata(metadata_subset, manifest_));
RETURN_NOT_OK(fragment->SetMetadata(metadata_subset, manifest_,
/*original_metadata=*/metadata_));
fragments[i++] = std::move(fragment);
}

Expand Down
5 changes: 4 additions & 1 deletion cpp/src/arrow/dataset/file_parquet.h
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@ class ARROW_DS_EXPORT ParquetFileFragment : public FileFragment {
std::optional<std::vector<int>> row_groups);

Status SetMetadata(std::shared_ptr<parquet::FileMetaData> metadata,
std::shared_ptr<parquet::arrow::SchemaManifest> manifest);
std::shared_ptr<parquet::arrow::SchemaManifest> manifest,
std::shared_ptr<parquet::FileMetaData> original_metadata = {});

// Overridden to opportunistically set metadata since a reader must be opened anyway.
Result<std::shared_ptr<Schema>> ReadPhysicalSchemaImpl() override {
Expand Down Expand Up @@ -219,6 +220,8 @@ class ARROW_DS_EXPORT ParquetFileFragment : public FileFragment {
std::vector<bool> statistics_expressions_complete_;
std::shared_ptr<parquet::FileMetaData> metadata_;
std::shared_ptr<parquet::arrow::SchemaManifest> manifest_;
// The FileMetaData that owns the SchemaDescriptor pointed by SchemaManifest.
std::shared_ptr<parquet::FileMetaData> original_metadata_;

friend class ParquetFileFormat;
friend class ParquetDatasetFactory;
Expand Down

0 comments on commit 4b3de81

Please sign in to comment.