Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] GpuInsertIntoHiveTable supports parquet format #9939

Closed
nvliyuan opened this issue Dec 4, 2023 · 8 comments
Closed

[FEA] GpuInsertIntoHiveTable supports parquet format #9939

nvliyuan opened this issue Dec 4, 2023 · 8 comments
Assignees
Labels
feature request New feature or request

Comments

@nvliyuan
Copy link
Collaborator

nvliyuan commented Dec 4, 2023

We observed fallback info in customer driverlog:

!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHiveTable> cannot run on GPU because unsupported output-format found: Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat), only org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat is currently supported; writing Hive delimited text tables has been disabled, to enable this, set spark.rapids.sql.format.hive.text.write.enabled to true; unsupported serde found: Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe), only org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe is currently supported
image
@nvliyuan nvliyuan added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 4, 2023
@revans2
Copy link
Collaborator

revans2 commented Dec 4, 2023

Spark already will take most Hive parquet writes and translate them back to a regular parquet write before we ever see it.

https://github.com/apache/spark/blob/2423a4517ace1d76938ee3fd285594900f272552/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L225-L284

I am not 100% sure what the conditions are that make it fall back to using Hive directly for the insert, and they have changed a bit over time. It looks primarily like it has to do with a partitioned write/insert into, but I am not 100% sure.

@nvliyuan do you have more information about the write exactly so we can reproduce this and know why it is falling back? Happy to get that info offline if we need to.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023
@firestarman
Copy link
Collaborator

firestarman commented Dec 7, 2023

Unfortnately this translation appears to not exist in earlier Spark versions. e.g. 3.2.x used by some customers.
https://github.com/apache/spark/blob/v3.2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L207C1-L233C2

Not sure if we can support more types of output format and serde for InsertIntoHiveTable. It looks like we are missing the parquet support for it.

@firestarman
Copy link
Collaborator

Hi @revans2, could you help confirm if we are missing the parquet support for GpuInsertIntoHiveTable and can treat this as a new feature request ? Thx in advance.

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

@firestarman We have NO parquet support in GpuInsertIntoHiveTable. It is completely and totally new work which we have not done before. We have always relied on Spark converting them to something that we could support.

I am a little confused by your previous comment. In 3.2.1 there is no translation for CreateHiveTableAsSelectCommand, but the link you posted still shows it supporting InsertIntoStatement, which is a common intermediate class for potentially becoming an InsertIntoHiveTable instance, if it is not converted.

The configs for HIVE are in https://github.com/apache/spark/blob/v3.2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala or similar. The last config change in the newest version of that file I see was for 3.1.0 and some of them go back to Spark 1.1.1, so all of the versions we support should deal with all of the same sets of features more or less.

Some of the relevant configs here are

When set to true, the built-in Parquet reader and writer are used to process parquet tables created by using the HiveQL syntax, instead of Hive serde.

When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when "spark.sql.hive.convertMetastoreParquet" is true.

When set to true, and spark.sql.hive.convertMetastoreParquet or spark.sql.hive.convertMetastoreOrc is true, the built-in ORC/Parquet writer is used to process inserting into partitioned ORC/Parquet tables created by using the HiveSQL syntax.

There are also a few others for CTAS and INSERT_DIR. But in all of these cases they are configured on by default, except for schema merging. So we need to understand first of all why is an InsertIntoHiveTable being put into the plan we are trying to run on the GPU instead of being converted by Spark. Supportting GpuInsertIntoHiveTable for a ParquetSerDe is potentially really big and I want to fully understand the problem we are getting into. Did the customer turn off the conversion? If so why did they turn it off? If it is on, what is making Spark think that it cannot do the conversion? Just because if Spark thinks it will get it wrong, I really want to be sure that we are testing the corner cases involved so we don't get it wrong.

@winningsix
Copy link
Collaborator

  • CONVERT_INSERTING_PARTITIONED_TABLE

Just checked on the configurations. spark.sql.hive.convertMetastoreParquet is explicitly turning on. And others are supposed to be default value.

@firestarman
Copy link
Collaborator

firestarman commented Dec 13, 2023

I am a little confused by your previous comment.

Thx a lot for the details.
I misunderstood the translation you mentioned @#9939 (comment), thought it might be something like converting the InsertIntoHiveTable to a InsertIntoHadoopFsRelationCommand as shown at the CTAS path https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L243C11-L243C19

@firestarman
Copy link
Collaborator

firestarman commented Jan 17, 2024

We have to support parquet in GpuInsertIntoHiveTable according to customer's requirements. The Hive version is 1.2.2.

@firestarman firestarman changed the title [FEA] it would be nice if we support Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe) serde [FEA] GpuInsertIntoHiveTable supports parquet format Jan 24, 2024
@firestarman
Copy link
Collaborator

It also asks for support for bucketing, tracked by #10366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants