-
Notifications
You must be signed in to change notification settings - Fork 706
[SEDONA-723] Add write format for (Geo)Arrow #1863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
import org.apache.spark.sql.catalyst.InternalRow; | ||
import org.apache.spark.sql.catalyst.encoders.AgnosticEncoder; | ||
import org.apache.spark.sql.catalyst.encoders.RowEncoder; | ||
import org.apache.spark.sql.connect.client.arrow.ArrowSerializer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to change this to org.apache.sedona.arrow.ArrowSerializer
. Do I need to do anything special to make Scala classes accessible to Java here? (Or should the whole thing be done in Scala?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ok to use scala classes in java code in the same project. However, it seems that the javadoc runs before the scala compiling finishes, and thus it fails the java doc process. We can set failedOnError to false since we are ignoring the doclint anyway in the maven-javadoc-plugin. E.g.,
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>2.10.4</version>
<executions>
<execution>
<id>attach-javadocs</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
<configuration>
<additionalparam>-Xdoclint:none</additionalparam>
<failOnError>false</failOnError>
</configuration>
</plugin>
Also, some of the classes created in this PR uses spark classes (e.g., SparkIntervalUtils) introduced in 3.5 so it won't build on 3.3 and 3.4 unless these codes are moved to version specific folders.
spark/common/pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-connect-common_${scala.compat.version}</artifactId> | ||
<version>${spark.version}</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this once the internal ArrowSerializer
can be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Spark common, we want to avoid any Java code. And, in particular, the arrow
data source, if we need to implement, must be in Scala as it needs to mix with DataSource V2 of Spark.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[SEDONA-723] my subject
.What changes were proposed in this PR?
This PR is intended to add
df.write.format("arrows")
when complete (but is currently just an exploration of this idea.How was this patch tested?
It will be with tests in Java (if this change seems worth it!)
Did this PR include necessary documentation updates?
In SEDONA-660, SEDONA-714, and SEDONA-717, we wired up the ArrowSerializer from SparkConnect to accelerate transfer between the JVM and Python on the driver. For queries whose results are arbitrarily large or unknown at the time of issuing the query, this can result in out-of-memory and it would be helpful to have an escape hatch. This is also a useful way for Sedona users to build services on top of Sedona (e.g., by returning the URLs to the written Arrow files as described in https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/ ).