[SEDONA-723] Add write format for (Geo)Arrow #1863

paleolimbot · 2025-03-18T19:08:07Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [SEDONA-723] my subject.

What changes were proposed in this PR?

This PR is intended to add df.write.format("arrows") when complete (but is currently just an exploration of this idea.

How was this patch tested?

It will be with tests in Java (if this change seems worth it!)

Did this PR include necessary documentation updates?

Yes, I am adding a new API (and will update docs if this idea is accepted!)

In SEDONA-660, SEDONA-714, and SEDONA-717, we wired up the ArrowSerializer from SparkConnect to accelerate transfer between the JVM and Python on the driver. For queries whose results are arbitrarily large or unknown at the time of issuing the query, this can result in out-of-memory and it would be helpful to have an escape hatch. This is also a useful way for Sedona users to build services on top of Sedona (e.g., by returning the URLs to the written Arrow files as described in https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/ ).

paleolimbot · 2025-03-25T16:29:08Z

spark/common/src/main/java/org/apache/sedona/sql/datasources/arrow/ArrowWriter.java

+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.encoders.AgnosticEncoder;
+import org.apache.spark.sql.catalyst.encoders.RowEncoder;
+import org.apache.spark.sql.connect.client.arrow.ArrowSerializer;


I can't seem to change this to org.apache.sedona.arrow.ArrowSerializer. Do I need to do anything special to make Scala classes accessible to Java here? (Or should the whole thing be done in Scala?)

It is ok to use scala classes in java code in the same project. However, it seems that the javadoc runs before the scala compiling finishes, and thus it fails the java doc process. We can set failedOnError to false since we are ignoring the doclint anyway in the maven-javadoc-plugin. E.g.,

<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-javadoc-plugin</artifactId> <version>2.10.4</version> <executions> <execution> <id>attach-javadocs</id> <goals> <goal>jar</goal> </goals> </execution> </executions> <configuration> <additionalparam>-Xdoclint:none</additionalparam> <failOnError>false</failOnError> </configuration> </plugin>

Also, some of the classes created in this PR uses spark classes (e.g., SparkIntervalUtils) introduced in 3.5 so it won't build on 3.3 and 3.4 unless these codes are moved to version specific folders.

paleolimbot · 2025-03-25T16:29:48Z

spark/common/pom.xml

+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-connect-common_${scala.compat.version}</artifactId>
+            <version>${spark.version}</version>
+        </dependency>


We can remove this once the internal ArrowSerializer can be used

jiayuasu

For Spark common, we want to avoid any Java code. And, in particular, the arrow data source, if we need to implement, must be in Scala as it needs to mix with DataSource V2 of Spark.

paleolimbot added 2 commits March 17, 2025 15:49

arrow spark writer stub

6f093c9

one more

87a695c

github-actions bot added the sedona-spark label Mar 18, 2025

paleolimbot added 4 commits March 18, 2025 14:39

maybe build on more than one spark/scala combo

a2795f8

maybe get building

b8ef651

fix warning

e066378

move arrow stuff to a different directory

212fd87

paleolimbot commented Mar 25, 2025

View reviewed changes

move ArrowSerializer to spark/spark-3.5

53fcece

jiayuasu requested changes Mar 25, 2025

View reviewed changes

paleolimbot added 5 commits March 25, 2025 14:33

port stubs to Scala

80eb0f5

format

88aff32

move everything to the same folder

8993eb4

construct a serializer

30740a9

mock write

a2968a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-723] Add write format for (Geo)Arrow #1863

[SEDONA-723] Add write format for (Geo)Arrow #1863

paleolimbot commented Mar 18, 2025

paleolimbot Mar 25, 2025

zhangfengcdt Mar 25, 2025

paleolimbot Mar 25, 2025

jiayuasu left a comment

[SEDONA-723] Add write format for (Geo)Arrow #1863

Are you sure you want to change the base?

[SEDONA-723] Add write format for (Geo)Arrow #1863

Conversation

paleolimbot commented Mar 18, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

paleolimbot Mar 25, 2025

Choose a reason for hiding this comment

zhangfengcdt Mar 25, 2025

Choose a reason for hiding this comment

paleolimbot Mar 25, 2025

Choose a reason for hiding this comment

jiayuasu left a comment

Choose a reason for hiding this comment