The sandboxed testing environment cannot use AWS #113

julienrf · 2024-03-12T14:43:32Z

In #107 we introduced a testing infrastructure that allows us to test several migration scenarios. Unfortunately, the streamChanges feature uses the spark-kinesis module under the hood and this module performs calls to the real AWS servers instead of using the containerized service.

Solutions to this problem could be to either fix the code of spark-kinesis to stay in the sandbox environment (this is a known issue, see localstack/localstack#677 and https://issues.apache.org/jira/browse/SPARK-27950), or to use something else than spark-kinesis.

Saving the initial data to the target database works, but saving additional changes fails with an exception “org.apache.hadoop.io.Text is not Serializable”. I do not fully understand the difference between the initial data and the streamed changes that causes the error to happen. The only difference is that the initial data is read by the connector, whereas the streamed changes are created by us. I changed the way we create it to mix-in the `Serializable` interface in it. This change allowed me to successfully run a data migration with stream changes enabled. Such a scenario can not be added to our test suite though because KCL only works with the real AWS servers (see scylladb#113)

The RDD keys are not serializable, which can fail some RDD operations. We create the RDD element keys _after_ repartitioning to avoid them being serialized across partitions. This change allowed me to successfully run a data migration with stream changes enabled. Such a scenario can not be added to our test suite though because KCL only works with the real AWS servers (see scylladb#113)

…nt classes. We adapted the `KinesisReceiver` and its related classes to work with DynamoDB Streams, and we renamed it into `KinesisDynamoDBReceiver`. These classes are based on the code from the original `spark-kinesis-asl` module with some slight modifications based on the following resources: - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html - https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79 - and our previous fork implementation As a result, instead of maintaining a complete fork of `spark-kinesis-asl`, we only maintain a copy of the relevant classes, which should result in much faster build times (especially in the CI). It is still not possible to test the streaming feature locally (thus not in the CI either), see scylladb#113. These changes were tested with my actual AWS account.

…nt classes. We adapted the `KinesisReceiver` and its related classes to work with DynamoDB Streams, and we renamed it into `KinesisDynamoDBReceiver`. These classes are based on the code from the original `spark-kinesis-asl` module with some slight modifications based on the following resources: - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html - https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79 - and our previous fork implementation As a result, instead of maintaining a complete fork of `spark-kinesis-asl`, we only maintain a copy of the relevant classes, which should result in much faster build times (especially in the CI). It is still not possible to test the streaming feature locally (thus not in the CI either), see scylladb#113. These changes were tested with my actual AWS account. Fixes scylladb#119

julienrf · 2024-07-13T09:25:24Z

Commenting on this issue instead of creating a new one because this is related to the testing infrastructure.

Currently, our testing infrastructure recreates the AWS stack (S3 and DynamoDB) in Docker containers. This works okay but comes with limitations:

we cannot test the streamChanges feature because that feature has hard-coded dependency on the real AWS (we cannot customize the endpoint used by some clients)
some authentication scenarios are hard to test properly: ultimately, we want to test that our authentication logic works with the real AWS, so there is no point in mocking AWS in a Docker container
our infrastructure does not allow us to perform benchmarks to ensure that we do not introduce performance regressions

While the first point could be (and should be, ideally) fixed by removing the hard-coded dependency on AWS, to address the second point we have no choice but having tests that use the real AWS. And, in practice, to fix the first point we would have to change things in our copy of the spark-kinesis project, which is undesirable (it is better to keep our copy as close as possible to the original so that we can merge the upstream improvements into our copy).

I believe those points motivate the need for having tests that use the real AWS instead of a containerized implementation of AWS. Except for the benchmarks, such tests should not be expensive because they would not consume a lot of bandwidth.

I propose the following course of action:

create AWS credentials that we can use for creating/deleting DynamoDB tables on the real AWS
create a new test module that tests the supported AWS authentication scenarios and the streaming feature
create a new GitHub workflow triggered by an explicit comment such as “Test on AWS” that runs the new test module
create a new test module that tests the migration throughput between real AWS (or real Apache Cassandra) and real ScyllaDB, and create a new GitHub workflow triggered by an explicit comment such as “Test performance” to run the module.

guy9 · 2024-07-22T12:44:07Z

Thanks @julienrf, @tzach please have a look

guy9 · 2024-08-15T06:41:41Z

@julienrf please proceed with your suggestion.
@tzach if you have any objections please let us know.

tzach · 2024-08-15T11:15:43Z

I'm worried that using AWS for testing in a public repo may lead to abuse.

julienrf · 2024-08-15T11:54:16Z

Regarding this point:

we can trigger the workflow only under some conditions: e.g. one of the maintainers posts a comment “Test on AWS” on the PR
by default, GitHub does not automatically run workflows when someone who is not a maintainer submits a PR

tzach · 2024-08-15T13:14:59Z

OK, if this is just for maintainers I'm good with that.
In other words: run on marge to master, not for every PR.

The previous implementation was failing when used on AWS DynamoDB because `conf.get(DynamoDBConstants.ENDPOINT)` was `null`. This was not caught by our tests because our tests always use a custom endpoint (see scylladb#113)

The previous implementation was failing when used on AWS DynamoDB because `conf.get(DynamoDBConstants.ENDPOINT)` was `null`. This was not caught by our tests because our tests always use a custom endpoint (see #113)

julienrf · 2024-08-30T15:55:29Z

#208 is addressing the first part of the action plan. I’ve created #210 to track the remaining steps.

julienrf mentioned this issue Mar 12, 2024

Do not try to infer a schema when migrating from DynamoDB to Alternator #105

Merged

julienrf mentioned this issue Apr 18, 2024

Support skipping the snapshot transfer when streaming changes #129

Merged

julienrf mentioned this issue Jun 25, 2024

Embrace the AWS SDK v2 everywhere #157

Merged

julienrf self-assigned this Jun 25, 2024

julienrf added the enhancement New feature or request label Jul 10, 2024

julienrf changed the title ~~Unable to test the streamChanges feature in a sandboxed environment~~ The sandboxed testing environment cannot use AWS Jul 13, 2024

julienrf mentioned this issue Aug 23, 2024

Apply the AlternatorEndpointProvider only on custom endpoints #202

Merged

This was referenced Aug 26, 2024

Add tests that access AWS #205

Closed

Add tests that access AWS #208

Merged

Performance regressions are not detected #210

Open

tarzanek closed this as completed in #208 Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The sandboxed testing environment cannot use AWS #113

The sandboxed testing environment cannot use AWS #113

julienrf commented Mar 12, 2024 •

edited

Loading

julienrf commented Jul 13, 2024

guy9 commented Jul 22, 2024

guy9 commented Aug 15, 2024

tzach commented Aug 15, 2024

julienrf commented Aug 15, 2024

tzach commented Aug 15, 2024

julienrf commented Aug 30, 2024

The sandboxed testing environment cannot use AWS #113

The sandboxed testing environment cannot use AWS #113

Comments

julienrf commented Mar 12, 2024 • edited Loading

julienrf commented Jul 13, 2024

guy9 commented Jul 22, 2024

guy9 commented Aug 15, 2024

tzach commented Aug 15, 2024

julienrf commented Aug 15, 2024

tzach commented Aug 15, 2024

julienrf commented Aug 30, 2024

julienrf commented Mar 12, 2024 •

edited

Loading