-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RLI Spark Hudi Error occurs when executing map #10609
Comments
had a discussion with @maheshguptags , Issue can be either related to deserialiser configs or some bug in RLI. He is trying without RLI and will let us know his findings. Thanks a lot for your contribution @maheshguptags |
@ad1happy2go I tried without RLI, it is working fine. however, when I add the |
Thanks @maheshguptags . As discussed are you getting same error with Hudi Streamer? |
@ad1happy2go as discussed, I have tried hudi delta stream but unfortunately, I could not execute it due to heap space issues even without sending any data. Command
Stacktrace for same
|
@maheshguptags I tried to reproduce the issue but couldn't do it. Following are the artefacts. Kafka-source.props
Command -
|
Had working session with @maheshguptags . We were able to consistently reproduce with composite key in his setup. although I couldn't reproduce in my setup. SO this issue is intermittent. @yihua Can you please check .hoodie (attached) as you requested. |
@ad1happy2go and @yihua any update on this? |
facing same issue, wait for updates |
@michael1991 just to check , Are you also using composite key? Can you post table configuration |
@ad1happy2go please check below: |
@michael1991 the above one is |
Thanks for reminding, i'm using Dataproc 2.1 with Spark 3.3.2 and Hudi 0.14.1. |
@maheshguptags I noticed in your timeline, there is multi writer kind of scenario - We will connect tomorrow to review this more why that is happening. I was under impression we are using just one writer. |
Sure let me schedule some time and we will discuss it. |
Any conclusion on this issue? I am facing same issue too. 10:29:32.481 [qtp264384338-719] ERROR org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader - Got exception when reading log file ============================ |
@bksrepo which version you used to load the data? Is it an upgraded table? Original issue is different here compared to your stack trace. Can you share all the table/writer configs or may be a reproducible code if possible. |
@ad1happy2go I am using spark 3.4.1 with hudi bundle 'hudi-spark3.4-bundle_2.12-0.14.0.jar', Hadoop is 3.3.6 and source database is mysql version 8.0.36 Reported ERROR comes at the time of saving the data-frame. upto df.show() code works fine. Thank you for your help. ================================================================================================================= from pyspark.sql import SparkSession,functions SparkSessionspark = SparkSession.builder Define MySQL connection properties along with selective columns with a where clause.mysql_props = { Read data from MySQLdf = spark.read.format("jdbc").options(**mysql_props).load() Define Hudi tables schema to avoide any auto FieldType conversion and casting issues.hoodie_schema = StructType([ hudi_options = { } df.show() Write data to Hudi COW table in Parquet format(df spark.stop() |
hey @bksrepo : can you file a new issue |
and @ad1happy2go : if you encounter any bugs wrt MDT or RLI, do keep me posted. |
@nsivabalan We haven't resolved the original issue and it is still open. |
@nsivabalan We were not able to reproduce this error in our setup. I went into multiple calls with @maheshguptags and setup the exact same setup in my local. But He is consistently reproduce this issue. Also discussed with @yihua on this before. Can you or @yihua also review the hoodie.properties(attached here - #10609 (comment)) and see in case you have any insights here. |
I hit the same error when I try to use record indexing:
Are there additional configs/jars that are needed? |
Hey @jayakasadev , I've resolved this issue by adding config on Spark |
@michael1991 can you add the value that you pass |
Sure @maheshguptags, due to I'm using GCP Dataproc, so I just set
|
Hi, @michael1991 thank you for solving this, I can run the deltastream with RLI. Out of curiosity, how did you figure out we need to pass the jar in extraPath?
@ad1happy2go will need some help in memory tuning for delta stream. please let me know if there is any doc fo it. |
Hey @maheshguptags , I just got inspired by GCP Dataproc Doc here: https://cloud.google.com/dataproc/docs/concepts/components/hudi |
Thank you very much @michael1991 !!. |
Another strange thing I noticed in log is :
Now I'm not sure if this is helping or not. |
@maheshguptags ConsumerConfig seems not Spark or Hudi classes, thats from Kafka, right? So these configurations doesn't work for Kafka. Is that configuration error, pass Spark configs into Kafka. |
I am not sure how it is execute/call.
|
@maheshguptags Then that warning message is produced from Kafka, just leave it. |
@michael1991 @maheshguptags Thanks for all the effort on this to find the solution. Do you know how to pass these configs using DataProc Serverless? |
I'm seeing the same issue on EMR 7.10 when enabling the RECORD_INDEX. It seems on EMR it's more painful to add
Is there an easy workaround for EMR? We're currently just providing the unversioned jars which are symlinked on EMR:
|
@Limess Did you find a way to resolve it in EMR ? |
No, I disabled record indices for now. |
I did a lesser pleasant option of adding the extraClassPath as part of spark submit step itself rather than at cluster level. I couldn't find a way to do so. Example for driver class path: (Note: I got the default from spark-defaults.conf)
|
I can confirm the same workaround for EMR fixes this (slightly different config, I assume the default varies slighty by EMR version) |
Greetings @Limess @subash-metica can you please share the exact command you used executing EMR step ? Not sure if I am missing something in step command or if this solve is not working for me. |
Adding the following to
and
|
Thanks, this worked for me too, though still I'm facing an issue, I'm also overriding the method My full command is:-
Certainly the class not found is the class which is extending the I tried with
|
Added in troubleshooting guide - #11716 |
I am trying to ingest the data using spark+kafka streaming to hudi table with the RLI index. but unfortunately ingesting 5-10 records is throwing the below issue.
Steps to reproduce the behavior:
Expected behavior
it should work end to end with RLI index enable
Environment Description
Hudi version : 14
Spark version : 3.4.0
Hive version : NA
Hadoop version : 3.3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Additional context
Hudi Configuration:
val hudiOptions = Map(
"hoodie.table.name" -> "customer_profile",
"hoodie.datasource.write.recordkey.field" -> "x,y",
"hoodie.datasource.write.partitionpath.field" -> "x",
"hoodie.datasource.write.precombine.field" -> "ts",
"hoodie.table.type" -> "COPY_ON_WRITE",
"hoodie.clean.max.commits" -> "6",
"hoodie.clean.trigger.strategy" -> "NUM_COMMITS",
"hoodie.cleaner.commits.retained" -> "4",
"hoodie.cleaner.parallelism" -> "50",
"hoodie.clean.automatic" -> "true",
"hoodie.clean.async" -> "true",
"hoodie.parquet.compression.codec" -> "snappy",
"hoodie.index.type" -> "RECORD_INDEX",
"hoodie.metadata.record.index.enable" -> "true",
"hoodie.metadata.record.index.min.filegroup.count " -> "20", # in trial
"hoodie.metadata.record.index.max.filegroup.count" -> "5000" )
Stacktrace
spark Ui log
The text was updated successfully, but these errors were encountered: