[FEA] `GpuInsertIntoHiveTable` supports parquet format #9939

nvliyuan · 2023-12-04T06:30:57Z

We observed fallback info in customer driverlog:

!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHiveTable> cannot run on GPU because unsupported output-format found: Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat), only org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat is currently supported; writing Hive delimited text tables has been disabled, to enable this, set spark.rapids.sql.format.hive.text.write.enabled to true; unsupported serde found: Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe), only org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe is currently supported

The text was updated successfully, but these errors were encountered:

revans2 · 2023-12-04T15:32:29Z

Spark already will take most Hive parquet writes and translate them back to a regular parquet write before we ever see it.

https://github.com/apache/spark/blob/2423a4517ace1d76938ee3fd285594900f272552/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L225-L284

I am not 100% sure what the conditions are that make it fall back to using Hive directly for the insert, and they have changed a bit over time. It looks primarily like it has to do with a partitioned write/insert into, but I am not 100% sure.

@nvliyuan do you have more information about the write exactly so we can reproduce this and know why it is falling back? Happy to get that info offline if we need to.

firestarman · 2023-12-07T08:02:49Z

Unfortnately this translation appears to not exist in earlier Spark versions. e.g. 3.2.x used by some customers.
https://github.com/apache/spark/blob/v3.2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L207C1-L233C2

Not sure if we can support more types of output format and serde for InsertIntoHiveTable. It looks like we are missing the parquet support for it.

firestarman · 2023-12-12T02:27:59Z

Hi @revans2, could you help confirm if we are missing the parquet support for GpuInsertIntoHiveTable and can treat this as a new feature request ? Thx in advance.

revans2 · 2023-12-12T18:56:42Z

@firestarman We have NO parquet support in GpuInsertIntoHiveTable. It is completely and totally new work which we have not done before. We have always relied on Spark converting them to something that we could support.

I am a little confused by your previous comment. In 3.2.1 there is no translation for CreateHiveTableAsSelectCommand, but the link you posted still shows it supporting InsertIntoStatement, which is a common intermediate class for potentially becoming an InsertIntoHiveTable instance, if it is not converted.

The configs for HIVE are in https://github.com/apache/spark/blob/v3.2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala or similar. The last config change in the newest version of that file I see was for 3.1.0 and some of them go back to Spark 1.1.1, so all of the versions we support should deal with all of the same sets of features more or less.

Some of the relevant configs here are

CONVERT_METASTORE_PARQUET

When set to true, the built-in Parquet reader and writer are used to process parquet tables created by using the HiveQL syntax, instead of Hive serde.

CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING

When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when "spark.sql.hive.convertMetastoreParquet" is true.

CONVERT_INSERTING_PARTITIONED_TABLE

When set to true, and spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc is true, the built-in ORC/Parquet writer is used to process inserting into partitioned ORC/Parquet tables created by using the HiveSQL syntax.

There are also a few others for CTAS and INSERT_DIR. But in all of these cases they are configured on by default, except for schema merging. So we need to understand first of all why is an InsertIntoHiveTable being put into the plan we are trying to run on the GPU instead of being converted by Spark. Supportting GpuInsertIntoHiveTable for a ParquetSerDe is potentially really big and I want to fully understand the problem we are getting into. Did the customer turn off the conversion? If so why did they turn it off? If it is on, what is making Spark think that it cannot do the conversion? Just because if Spark thinks it will get it wrong, I really want to be sure that we are testing the corner cases involved so we don't get it wrong.

winningsix · 2023-12-12T23:54:10Z

CONVERT_INSERTING_PARTITIONED_TABLE

Just checked on the configurations. spark.sql.hive.convertMetastoreParquet is explicitly turning on. And others are supposed to be default value.

firestarman · 2023-12-13T01:41:19Z

I am a little confused by your previous comment.

Thx a lot for the details.
I misunderstood the translation you mentioned @#9939 (comment), thought it might be something like converting the InsertIntoHiveTable to a InsertIntoHadoopFsRelationCommand as shown at the CTAS path https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L243C11-L243C19

firestarman · 2024-01-17T13:43:14Z

We have to support parquet in GpuInsertIntoHiveTable according to customer's requirements. The Hive version is 1.2.2.

firestarman · 2024-02-02T03:14:07Z

It also asks for support for bucketing, tracked by #10366

nvliyuan added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 4, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023

firestarman changed the title ~~[FEA] it would be nice if we support Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe) serde~~ [FEA] GpuInsertIntoHiveTable supports parquet format Jan 24, 2024

firestarman self-assigned this Feb 21, 2024

firestarman mentioned this issue May 28, 2024

GpuInsertIntoHiveTable supports parquet format #10912

Merged

sameerz mentioned this issue Jun 3, 2024

Fix a hive write test failure #10958

Merged

firestarman closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] `GpuInsertIntoHiveTable` supports parquet format #9939

[FEA] `GpuInsertIntoHiveTable` supports parquet format #9939

nvliyuan commented Dec 4, 2023

revans2 commented Dec 4, 2023

firestarman commented Dec 7, 2023 •

edited

Loading

firestarman commented Dec 12, 2023

revans2 commented Dec 12, 2023

winningsix commented Dec 12, 2023

firestarman commented Dec 13, 2023 •

edited

Loading

firestarman commented Jan 17, 2024 •

edited

Loading

firestarman commented Feb 2, 2024

[FEA] GpuInsertIntoHiveTable supports parquet format #9939

[FEA] GpuInsertIntoHiveTable supports parquet format #9939

Comments

nvliyuan commented Dec 4, 2023

revans2 commented Dec 4, 2023

firestarman commented Dec 7, 2023 • edited Loading

firestarman commented Dec 12, 2023

revans2 commented Dec 12, 2023

winningsix commented Dec 12, 2023

firestarman commented Dec 13, 2023 • edited Loading

firestarman commented Jan 17, 2024 • edited Loading

firestarman commented Feb 2, 2024

[FEA] `GpuInsertIntoHiveTable` supports parquet format #9939

[FEA] `GpuInsertIntoHiveTable` supports parquet format #9939

firestarman commented Dec 7, 2023 •

edited

Loading

firestarman commented Dec 13, 2023 •

edited

Loading

firestarman commented Jan 17, 2024 •

edited

Loading