-
Notifications
You must be signed in to change notification settings - Fork 368
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Post-Processing for Datalake Cloud Source Connectors (#161)
After processing files from the S3/GCP Storage source, this enables the feature of deleting or moving the files after they've been committed. # New KCQL Configuration Options for Datalake Cloud Connectors The following configuration options introduce post-processing capabilities for the AWS S3, GCP Storage, and (coming soon) Azure Datalake Gen 2 **source connectors**. These options allow the connector to manage source files after they are successfully processed, either by deleting the file or moving it to a new location in cloud storage. In Kafka Connect, post-processing is triggered when the framework calls the `commitRecord` method after a source record is successfully processed. The configured action then determines how the source file is handled. If no `post.process.action` is configured, **no post-processing will occur**, and the file will remain in its original location. --- ## KCQL Configuration Options ### 1. `post.process.action` - **Description**: Defines the action to perform on a file after it has been processed. - **Options**: - `DELETE` – Removes the file after processing. - `MOVE` – Relocates the file to a new location after processing. ### 2. `post.process.action.bucket` - **Description**: Specifies the target bucket for files when using the `MOVE` action. - **Applicability**: Only applies to the `MOVE` action. - **Notes**: This field is **mandatory** when `post.process.action` is set to `MOVE`. ### 3. `post.process.action.prefix` - **Description**: Specifies a new prefix to replace the existing one for the file’s location when using the `MOVE` action. The file's path will remain unchanged except for the prefix. - **Applicability**: Only applies to the `MOVE` action. - **Notes**: This field is **mandatory** when `post.process.action` is set to `MOVE`. --- ## Key Use Cases - **DELETE**: Automatically removes source files to free up storage space and prevent redundant data from remaining in the bucket. - **MOVE**: Organizes processed source files by relocating them to a different bucket or prefix, which is useful for archiving, categorizing, or preparing files for other workflows. --- ## Examples ### Example 1: Deleting Files After Processing To configure the source connector to delete files after processing, use the following KCQL: ```kcql INSERT INTO `my-bucket` SELECT * FROM `my-topic` PROPERTIES ( 'post.process.action'=`DELETE` ) ``` ### Example 2: Moving Files After Processing To configure the source connector to move files to a different bucket and prefix, use the following KCQL: ```kcql INSERT INTO `my-bucket:archive/` SELECT * FROM `my-topic` PROPERTIES ( 'post.process.action'=`MOVE`, 'post.process.action.bucket'=`archive-bucket`, 'post.process.action.prefix'=`archive/` ) ``` In this example: * The file is moved to `archive-bucket`. * The prefix `archive/` is applied to the file’s path while keeping the rest of the path unchanged. ## Important Considerations * Both `post.process.action.bucket` and `post.process.action.prefix` are mandatory when using the `MOVE` action. * For the `DELETE` action, no additional configuration is required. * If no `post.process.action` is configured, no post-processing will be applied, and the file will remain in its original location. * * Configuration for Burn-After-Reading * Implementing actions and storage interfaces. Needs add tests. The file move logic needs testing where it resolves the path - is this even the best configuration? * Storage interface tests * Address comment from review referencing this page on moving items in GCP: https://cloud.google.com/storage/docs/samples/storage-move-file * * Adding temporary logging, fixing a bug with the Map equality not enabling prefixes to map to each other * Fix Move action * Fix prefix replace behaviour * Changes to ensure error handling approach is correct * Review fixes - remove S3 references * Avoid variable shadowing * Avoid variable shadowing * add documentation * CopyObjectResponse
- Loading branch information
1 parent
4a16747
commit c272cd5
Showing
44 changed files
with
1,139 additions
and
149 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
105 changes: 105 additions & 0 deletions
105
...test/scala/io/lenses/streamreactor/connect/aws/s3/storage/AwsS3StorageInterfaceTest.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
/* | ||
* Copyright 2017-2024 Lenses.io Ltd | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package io.lenses.streamreactor.connect.aws.s3.storage | ||
|
||
import io.lenses.streamreactor.connect.cloud.common.config.ConnectorTaskId | ||
import io.lenses.streamreactor.connect.cloud.common.storage.FileMoveError | ||
import org.mockito.ArgumentMatchersSugar | ||
import org.mockito.MockitoSugar | ||
import org.scalatest.EitherValues | ||
import org.scalatest.flatspec.AnyFlatSpecLike | ||
import org.scalatest.matchers.should.Matchers | ||
import software.amazon.awssdk.services.s3.S3Client | ||
import software.amazon.awssdk.services.s3.model.CopyObjectRequest | ||
import software.amazon.awssdk.services.s3.model.CopyObjectResponse | ||
import software.amazon.awssdk.services.s3.model.DeleteObjectRequest | ||
import software.amazon.awssdk.services.s3.model.DeleteObjectResponse | ||
import software.amazon.awssdk.services.s3.model.HeadObjectRequest | ||
import software.amazon.awssdk.services.s3.model.HeadObjectResponse | ||
import software.amazon.awssdk.services.s3.model.NoSuchKeyException | ||
|
||
class AwsS3StorageInterfaceTest | ||
extends AnyFlatSpecLike | ||
with Matchers | ||
with MockitoSugar | ||
with ArgumentMatchersSugar | ||
with EitherValues { | ||
|
||
"mvFile" should "move a file from one bucket to another successfully" in { | ||
val s3Client = mock[S3Client] | ||
val storageInterface = new AwsS3StorageInterface(mock[ConnectorTaskId], s3Client, batchDelete = false, None) | ||
|
||
val copyObjectResponse: CopyObjectResponse = CopyObjectResponse.builder().build() | ||
when(s3Client.copyObject(any[CopyObjectRequest])).thenReturn(copyObjectResponse) | ||
val deleteObjectResponse: DeleteObjectResponse = DeleteObjectResponse.builder().build() | ||
when(s3Client.deleteObject(any[DeleteObjectRequest])).thenReturn(deleteObjectResponse) | ||
|
||
val result = storageInterface.mvFile("oldBucket", "oldPath", "newBucket", "newPath") | ||
|
||
result shouldBe Right(()) | ||
verify(s3Client).copyObject(any[CopyObjectRequest]) | ||
verify(s3Client).deleteObject(any[DeleteObjectRequest]) | ||
} | ||
|
||
it should "return a FileMoveError if copyObject fails" in { | ||
val s3Client = mock[S3Client] | ||
val storageInterface = new AwsS3StorageInterface(mock[ConnectorTaskId], s3Client, batchDelete = false, None) | ||
|
||
when(s3Client.copyObject(any[CopyObjectRequest])).thenThrow(new RuntimeException("Copy failed")) | ||
|
||
val result = storageInterface.mvFile("oldBucket", "oldPath", "newBucket", "newPath") | ||
|
||
result.isLeft shouldBe true | ||
result.left.value shouldBe a[FileMoveError] | ||
verify(s3Client).copyObject(any[CopyObjectRequest]) | ||
verify(s3Client, never).deleteObject(any[DeleteObjectRequest]) | ||
} | ||
|
||
it should "return a FileMoveError if deleteObject fails" in { | ||
val s3Client = mock[S3Client] | ||
val storageInterface = new AwsS3StorageInterface(mock[ConnectorTaskId], s3Client, batchDelete = false, None) | ||
|
||
val headObjectResponse: HeadObjectResponse = HeadObjectResponse.builder().build() | ||
when(s3Client.headObject(any[HeadObjectRequest])).thenReturn(headObjectResponse) | ||
val copyObjectResponse: CopyObjectResponse = CopyObjectResponse.builder().build() | ||
when(s3Client.copyObject(any[CopyObjectRequest])).thenReturn(copyObjectResponse) | ||
when(s3Client.deleteObject(any[DeleteObjectRequest])).thenThrow(new RuntimeException("Delete failed")) | ||
|
||
val result = storageInterface.mvFile("oldBucket", "oldPath", "newBucket", "newPath") | ||
|
||
result.isLeft shouldBe true | ||
result.left.value shouldBe a[FileMoveError] | ||
verify(s3Client).copyObject(any[CopyObjectRequest]) | ||
verify(s3Client).deleteObject(any[DeleteObjectRequest]) | ||
} | ||
|
||
it should "pass if no source object exists" in { | ||
val s3Client = mock[S3Client] | ||
val storageInterface = new AwsS3StorageInterface(mock[ConnectorTaskId], s3Client, batchDelete = false, None) | ||
|
||
when(s3Client.headObject(any[HeadObjectRequest])).thenThrow(NoSuchKeyException.builder().build()) | ||
val copyObjectResponse: CopyObjectResponse = CopyObjectResponse.builder().build() | ||
when(s3Client.copyObject(any[CopyObjectRequest])).thenReturn(copyObjectResponse) | ||
when(s3Client.deleteObject(any[DeleteObjectRequest])).thenThrow(new RuntimeException("Delete failed")) | ||
|
||
val result = storageInterface.mvFile("oldBucket", "oldPath", "newBucket", "newPath") | ||
|
||
result.isRight shouldBe true | ||
verify(s3Client).headObject(any[HeadObjectRequest]) | ||
verify(s3Client, never).copyObject(any[CopyObjectRequest]) | ||
verify(s3Client, never).deleteObject(any[DeleteObjectRequest]) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.