Skip to content

Latest commit

 

History

History
57 lines (44 loc) · 1.94 KB

README.md

File metadata and controls

57 lines (44 loc) · 1.94 KB

Spark Adaptive File Connector

The library provides Spark Source to efficiently handle streaming from object store for files saved in dynamically changing paths, e.g. s3://my_bucket/my-data/2022/06/09/, that might be described as s3://my_bucket/my-data/{YYYY}/{MM}/{DD}/.

Usage

Dependency

The library is available on Maven Central. If you use SBT you can add it to your build.sbt:

libraryDependencies += "ai.hunters" %% "spark-adaptive-file-connector" % "1.0.0"

for Maven:

<dependency>
    <groupId>ai.hunters</groupId>
    <artifactId>spark-adaptive-file-connector_2.12</artifactId>
    <version>1.0.0</version>
</dependency>

or for other dependency manager.

Spark Application

This is the simplest example of using the connector in Scala application:

val readFileJsonStream = spark.readStream
      .format("dynamic-paths-file")
      .option("fileFormat", "json")
      .schema(yourSchema)
      .load("s3://my-bucket/prefix/{YYYY}/{MM}/{DD}")

Build

sbt "clean;compile"

Testing

sbt "clean;test"

Tutorial - quick run

The easiest way to start playing with this library is an integration test DynamicPathsFileStreamSourceProviderITest. It shows how to:

  • pass parameters to Spark to be able to use this source connector
  • check what paths have been read and how fast
  • recover from checkpoint
  • see output

When you use this library in your app please remember to register the DynamicPathsFileStreamSourceProvider in your DataSourceRegister. You can see example what was needed for tests in src/test/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister.

Video

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks (Data&AI Summit, San Francisco 2022)