Write protocol should be Apache Arrow #49

faucct · 2024-11-12T11:09:49Z

Summary:

YTGetters<Row, List, Dict> is a new interface, which is going to be pushed to the YT-client. It works as a typeclass, while implementations of its inner abstract classes are responsible for adapting rows to the YT types. The typeclass thing did not work out so well, so I will probably rewrite this by pushing the generic parameters down to each abstract class and making them static.
ArrowTableRowsSerializer is an implementation of writing YT-data via Apache Arrow protocol. The data is being received using the YT-getters. This is also supposed to be moved to the YT-client at some point.
TableWriterBaseImpl is copied from the YT-client to shadow the original class and patched to extend it with new functionality. Also is supposed to be moved to the YT-client.
YTLogicalType interfaces are extended with methods providing YTGetters implementations, mapping Spark data to YT. This part is supposed to stay in our repository.

All tests in ComplexTypeV3Test, except the ones about the nested dates, are working. We will need to wait for a fix in YT-storage for that – https://github.com/ytsaurus/ytsaurus/pull/942/files

alextokarew · 2024-11-14T05:43:58Z

data-source/src/main/scala/tech/ytsaurus/client/ArrowTableRowsSerializer.java

@@ -0,0 +1,1171 @@
+package tech.ytsaurus.client;


Will this class be moved to ytsaurus-client? Otherwise it should be moved to data-source/src/main/java

It will be moved, yes.

alextokarew · 2024-11-14T05:48:00Z

data-source/src/main/scala/tech/ytsaurus/spyt/serializers/YtLogicalType.scala

 import tech.ytsaurus.core.tables.ColumnValueType
 import tech.ytsaurus.spyt.serializers.SchemaConverter.MetadataFields
+import tech.ytsaurus.spyt.serializers.YsonRowConverter.{isNull, serializeValue}
+import tech.ytsaurus.spyt.serializers.YtLogicalType.Binary.tiType


A lot of imports of tiType can be confusing. I think we can import tech.ytsaurus.spyt.serializers.YtLogicalType._ and use tiType prefixed with its type.

alextokarew · 2024-11-14T05:55:39Z

data-source/src/main/scala/tech/ytsaurus/spyt/serializers/InternalRowSerializer.scala

 import scala.concurrent.duration.Duration
 import scala.concurrent.{ExecutionContext, Future}

-class InternalRowSerializer(schema: StructType, writeSchemaConverter: WriteSchemaConverter) extends WireRowSerializer[InternalRow] with LogLazy {


I think we should preserve this old implementation and have a configuration flag which will allow us to switch between this and new write implementation to preserve backward compatibility. Moreover, I think in the next release the old implementation should be used by default and the arrow implementation will be considered as experimental and turned on explicitly.

alextokarew · 2024-11-14T06:00:52Z

data-source/src/main/scala/tech/ytsaurus/spyt/serializers/YtLogicalType.scala

  import tech.ytsaurus.spyt.types.YTsaurusTypes.instance.sparkTypeFor

-  case object Null extends AtomicYtLogicalType("null", 0x02, ColumnValueType.NULL, TiType.nullType(), NullType)
+  case object Null extends AtomicYtLogicalType("null", 0x02, ColumnValueType.NULL, TiType.nullType(), NullType) {
+    override def ytGettersFromList(ytGetter: InternalRowYTGetters): ytGetter.FromList = new ytGetter.FromListToNull {


Can we have an abstract implementations of YtGetters so we don't need to implement it in every case object? Consider adding extra parameters to AtomicYtLogicalType

robot-magpie · 2024-11-14T06:02:46Z

@alextokarew has imported your pull request. If you are a Yandex employee, you can view this diff.

clean up imports arrow_write_enabled config

ArrowTableRowsSerializer

8927ca5

faucct requested a review from alextokarew November 12, 2024 11:12

alextokarew approved these changes Nov 14, 2024

View reviewed changes

faucct added 11 commits November 19, 2024 15:48

PR comments:

161536e

clean up imports arrow_write_enabled config

fix SchemaConverterTest

898a56e

fix YtInputSplitTest

089017c

fix UInt64DecimalTest

3552200

clean-up imports

23a19a6

disable arrow writing by default

56d9770

fix YtFileFormatTest

1e6d021

YTGetters interfaces should be static

dbd6cd9

YtLogicalType FromList and FromStruct

d230beb

ArrowTableRowsSerializer should close stuff

f742569

groom

b221450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write protocol should be Apache Arrow #49

Write protocol should be Apache Arrow #49

faucct commented Nov 12, 2024

alextokarew Nov 14, 2024

faucct Nov 19, 2024

alextokarew Nov 14, 2024

alextokarew Nov 14, 2024

alextokarew Nov 14, 2024

robot-magpie bot commented Nov 14, 2024

Write protocol should be Apache Arrow #49

Are you sure you want to change the base?

Write protocol should be Apache Arrow #49

Conversation

faucct commented Nov 12, 2024

alextokarew Nov 14, 2024

Choose a reason for hiding this comment

faucct Nov 19, 2024

Choose a reason for hiding this comment

alextokarew Nov 14, 2024

Choose a reason for hiding this comment

alextokarew Nov 14, 2024

Choose a reason for hiding this comment

alextokarew Nov 14, 2024

Choose a reason for hiding this comment

robot-magpie bot commented Nov 14, 2024