You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, question for the audience. How should we handle the calculation of fields in a dataset map function? Do we need a manual A -> B class mapping or can we use something generic like the map Row -> Row and then to to handle the field mappings? I can see it becoming tedious to pass in every since property from A -> B. Some of the classes I need to process have 20+ properties.
For example, I would like to accept fields as String?, but in later data frames I want to convert them to Int?. Maybe it would be something like what is below. I know that this example is small but keep in mind I don't want to pass in all 20+ properties to another class constructor. Maybe in the example below we should use sealed classes for possible errors? Not to opinionated about how this should be handled. I know the following code cannot work since RDD's are immutable but it would nice to have some kind of convenience like below to work around that.
data class Client(val age: String?)
data class ClientCalculated(val age: Int?, val errorMessage: String?)
fun litNullAsString() = functions.lit(null).cast(DataTypes.StringType)
listOf(Client("30"), Client("thirty"))
.toDS()
.withColumn("errorMessage", litNullAsString())
.map {
val age = it.getAs<String?>("age")?.toIntOrNull()
it["age"] = age
it["errorMessage"] = if (age == null) {
"age is invalid"
} else {
null
}
it
}.to<ClientCalculated>()
The text was updated successfully, but these errors were encountered:
How would you convert a data class to another data class outside of Spark? I think manually passing the 20+ properties ensures you have the correct names and types all the time. Of course, you could provide some helper function as factory function of the type you want to convert to, which is probably how I would handle it:
data classClient(valextraProp1:Int, valextraProp2:Double, valage:String?)
data classClientCalculated(valextraProp1:Int, valextraProp2:Double, valage:Int?, valerrorMessage:String?) {
companionobject {
funfromClient(client:Client): ClientCalculated {
val age = client.age?.toIntOrNull()
val errorMessage =if (age ==null) "Age is not a number"elsenullreturnClientCalculated(client.extraProp1, client.extraProp2, age, errorMessage)
}
}
}
withSpark {
listOf(Client(1, 1.0, "30"), Client(2, 2.0, "thirty"))
.toDS()
.map { ClientCalculated.fromClient(it) }
}
Not sure how sealed classes would help here. Also creating a new row with names is unfortunately not possible in spark as far as I know. Curious what other people think too!
One more idea: in JetBrains, in our Spark workflows, we usually use raw data frames, and we only require functions to accept and return typed datasets. I just wanted to let you know that you can do the same. This way, in many parts of your workload, you could omit the tedious work :)
Hello, question for the audience. How should we handle the calculation of fields in a dataset map function? Do we need a manual A -> B class mapping or can we use something generic like the map Row -> Row and then to to handle the field mappings? I can see it becoming tedious to pass in every since property from A -> B. Some of the classes I need to process have 20+ properties.
For example, I would like to accept fields as String?, but in later data frames I want to convert them to Int?. Maybe it would be something like what is below. I know that this example is small but keep in mind I don't want to pass in all 20+ properties to another class constructor. Maybe in the example below we should use sealed classes for possible errors? Not to opinionated about how this should be handled. I know the following code cannot work since RDD's are immutable but it would nice to have some kind of convenience like below to work around that.
The text was updated successfully, but these errors were encountered: