3.4+ #218

Jolanrensen · 2024-03-17T15:14:09Z

Fixes #195, which is a fun read if you're interested in the process :)

This is a work-in-progress overhaul of the core parts of the library to support Spark 3.4+.

Why

Too much has changed in Spark 3.4+ due to Spark decoupling their encoding/decoding system with Spark Connect in mind.
Our previous method was hacky and made us publish exact versions of Spark to maintain bytecode-level compatibility.
We too would like Spark Connect support in the future :)
We need to keep supporting newer Spark versions

What has changed

Removed the :core module entirely. No more spark-package injected code that can break at the bytecode level.
Instead, we just have a :scala-helpers module which doesn't even depend on Spark atm. We just need the VarargUnwrapper class.
Rewrote Encoding from the ground up in pure Kotlin this time. We use the power of Kotlin reflection. I took inspiration of JavaTypeInference and ScalaReflection, which, since 3.4, now build an AgnosticEncoder as a sort-of intermediate step in building an Encoder for the data. This non-implementation-specific encoder can be turned into an actual encoder by passing it to ExpressionEncoder() or into something entirely different (which is what makes Spark Connect possible).
Our KotlinTypeInference.encoderFor implementation is a mix of the Java and Scala types, supporting both Scala/Java lists, primitives, scala Tuples, and most importantly Kotlin data classes.
One downside of having to create an AgnosticEncoder is that we are limited to the AgnosticEncoders offered to us by Spark. We cannot write our own (de)serializers anymore if we want to support Spark Connect. So, in order to support data classes, we need to hijack ProductEncoder.
Deserializing data classes using ProductEncoder works fine, but for serializing we hit a snag. In Scala, case classes have a function with the same name as each property. This assumption is used under the hood, so we need to make sure those functions exist in our data classes.
Plus, later I found this function to do an actual instance check to see if the value is a scala.Product... It's compiler plugin time!
I created a Kotlin compiler plugin which, when applied to your project, can convert:

@Sparkify
data class User(
    val name: String,
    @ColumnName("test") val age: Int,
)

to

@Sparkify
data class User(
    @get:JvmName("name") val name: String,
    @get:JvmName("test") @ColumnName("test") val age: Int,
): scala.Product, Serializable {
  override fun canEqual(that: Any?): Boolean = that is User
  override fun productArity(): Int = 2
  override fun productElement(n: Int): Any? =
    if (n == 0) this.name
    else if (n == 1) this.age
    else throw IndexOutOfBoundsException()
}

satisfying both needs from Spark. One downside of this approach is that now you need to annotate each data class you want to encode with @Sparkify (else the column names will be getName and getAge). And you cannot annotate external data classes like Pair :/ So I recommend working with tuples from now on (or make your own @Sparkify Pair).

The compiler plugin (:compiler-plugin) is going to be applicable to your Gradle project by the gradle plugin (:gradle-plugin) with id("org.jetbrains.kotlinx.spark.api") version X or in maven with the <compilerPlugins> tag (probably).
The :kotlin-spark-api and :examples modules also depend on these two plugins for their tests. This is done with a gradle trick that updates bootstrap jars and adds them to the classpath/repository.
Updated to Kotlin 2.0 Beta 5. You should still be able to use 1.9.23 with the compiler plugin, since it just uses IR. It does not require K2.
For Kotlin 2.0, just make sure you set freeCompilerArgs.add("-Xlambdas=class") since Spark cannot serialize lamdas otherwise. If you use the gradle plugin, this is done for you.

TODO

Provide warnings for non-Sparkified classes, especially for Pair/Triple
Java bean as fallback encoder
Jupyter support
Finalize Jupyter support
UDTs for non-generic Kotlin types like Instant, LocalDateTime etc.
Spark Connect
Docs
Fix RddTest "Work with any number"
Remove streaming in favor of structured streaming, update examples

…urely in Kotlin. Not complete yet

…he new kotlinEncoderFor<>()

…the expression encoder without spark-connect

… in favor of upcoming IR compiler plugin

… in favor of upcoming IR compiler plugin. Removed spark dependency in scala-helpers

… as separate module

…lass, which can be done with gradle plugin

….NoSuchMethodException: org.jetbrains.kotlinx.spark.api.RddTest$1$1$1$2$2$1.$deserializeLambda$(java.lang.invoke.SerializedLambda)

a) it can build gradle-plugin and compiler-plugin with bootstrap jars without mavenLocal. b) bootstrap jars are updated before the actual build

…lus added test to see if the plugin is enabled in the project

…, and LocalDate, Duration not working

Jolanrensen · 2024-04-07T13:56:04Z

added encoding for KotlinX: DatePeriod, DateTimePeriod, Instant, LocalDateTime, and LocalDate, kotlin.time.Duration is sadly not working as it's a value class. (I think that's the reason)

…sion for jupyter. Disabled html renderes in favor of just outputting them as text. Notebooks can render them however they like. RDDs are converted to ds before rendering

… java 8 version

…-api

Jolanrensen added 5 commits March 17, 2024 16:07

updating version range to 3.4.2 and 3.5.1

cda9548

(re)moving some stuff from the decoupled :core module

42ba339

adding initial new KotlinTypeInference Encoder generator, this time p…

4896354

…urely in Kotlin. Not complete yet

updating all references to the old encoder<>() function in favor of t…

e234f40

…he new kotlinEncoderFor<>()

updating tests tests

66a42ac

Jolanrensen added the enhancement New feature or request label Mar 17, 2024

Jolanrensen mentioned this pull request Mar 17, 2024

Spark 3.4+ / -Connect support #195

Open

Jolanrensen added 4 commits March 17, 2024 16:36

made name hack optional as it is not waterproof

3e9261f

added tests and fixed name hack

0d061b6

Map support

0c8f4b1

disable name hack by default again, added JCP case for auto-applying …

9149607

…the expression encoder without spark-connect

Jolanrensen force-pushed the 3.4+ branch from a27a789 to 9149607 Compare March 18, 2024 12:48

Jolanrensen added 13 commits March 18, 2024 15:32

udt fix

4364022

enabled core as scala-helpers with VarargUnwrapper. Removed name hack…

2c875ff

… in favor of upcoming IR compiler plugin

enabled core as scala-helpers with VarargUnwrapper. Removed name hack…

d60e4dc

… in favor of upcoming IR compiler plugin. Removed spark dependency in scala-helpers

updating some build parameters

4f8d874

bumped to kotlin 2.0.0-Beta5, included the compiler plugin with tests…

2a878f1

… as separate module

Added gradle plugin and fixed package naming errors

c0a3140

enabling the compiler plugin on modules, sparkifying data classes

1b0b316

bumping down to kotlin 1.9.23 to make lambda's serializable again

b7c1711

fixing tests

7069a9a

fixing more tests. Can now remain at Kotlin 2.0 if we set -Xlambdas=c…

df021c0

…lass, which can be done with gradle plugin

added conversion for @sparkify'ed classes to scala.Product with tests.

4c17859

started WIP fir plugin, disabled for now

2f62d07

disabled RddTest "Work with any number" for now: Caused by: java.lang…

770698b

….NoSuchMethodException: org.jetbrains.kotlinx.spark.api.RddTest$1$1$1$2$2$1.$deserializeLambda$(java.lang.invoke.SerializedLambda)

Jolanrensen changed the base branch from release to main March 24, 2024 16:10

Jolanrensen and others added 2 commits March 25, 2024 12:22

Merge branch 'main' into 3.4+

5d5fe84

added publishToMavenLocal for the plugins in gh actions build.yml

24952b7

Jolanrensen force-pushed the 3.4+ branch from 76f5b05 to 6e0a33d Compare March 25, 2024 11:54

Merge remote-tracking branch 'origin/3.4+' into 3.4+

ffedb29

Jolanrensen force-pushed the 3.4+ branch from 6e0a33d to ffedb29 Compare March 25, 2024 11:59

Jolanrensen added 4 commits March 25, 2024 21:02

editing build so that

6b63c96

a) it can build gradle-plugin and compiler-plugin with bootstrap jars without mavenLocal. b) bootstrap jars are updated before the actual build

fixing compiler-plugin tests

4e5438b

fixed conversions for scala 2.12

723152b

small fixes, disabled jupyter module for now

0db290a

Jolanrensen force-pushed the 3.4+ branch from ffe14ef to 0db290a Compare March 25, 2024 22:06

Jolanrensen added 6 commits March 25, 2024 23:20

spark 3.4/3.5 compatibility issue with ProductEncoder

37eec64

fixed streaming test scala 2.12

0ab212b

added checkIsSparkified warnings when building an encoder

1c34429

updated compiler plugin to make @sparkify classes Serializable too, p…

68e830a

…lus added test to see if the plugin is enabled in the project

added java bean class fallback support

48db819

added encoding for DatePeriod, DateTimePeriod, Instant, LocalDateTime…

ab4c455

…, and LocalDate, Duration not working

Jolanrensen added 4 commits April 10, 2024 12:30

kotlin 2.0.0-RC1, enabled jupyter module. Added basic Sparkify conver…

e05feac

…sion for jupyter. Disabled html renderes in favor of just outputting them as text. Notebooks can render them however they like. RDDs are converted to ds before rendering

make render in jupyter use show instead

7fd77be

disabled jupyter tests relying on DISPLAY(), update jupyter to latest…

92b699d

… java 8 version

disabled unsupported rddtests

eae0196

Jolanrensen mentioned this pull request Jun 10, 2024

The future of this project #220

Open

added spark connect example which does not yet work with kotlin-spark…

22fa5ae

…-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.4+ #218

3.4+ #218

Jolanrensen commented Mar 17, 2024 •

edited

Loading

Jolanrensen commented Apr 7, 2024

3.4+ #218

Are you sure you want to change the base?

3.4+ #218

Conversation

Jolanrensen commented Mar 17, 2024 • edited Loading

Why

What has changed

TODO

Jolanrensen commented Apr 7, 2024

Jolanrensen commented Mar 17, 2024 •

edited

Loading