-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The sandboxed testing environment cannot use AWS #113
Comments
Saving the initial data to the target database works, but saving additional changes fails with an exception “org.apache.hadoop.io.Text is not Serializable”. I do not fully understand the difference between the initial data and the streamed changes that causes the error to happen. The only difference is that the initial data is read by the connector, whereas the streamed changes are created by us. I changed the way we create it to mix-in the `Serializable` interface in it. This change allowed me to successfully run a data migration with stream changes enabled. Such a scenario can not be added to our test suite though because KCL only works with the real AWS servers (see scylladb#113)
The RDD keys are not serializable, which can fail some RDD operations. We create the RDD element keys _after_ repartitioning to avoid them being serialized across partitions. This change allowed me to successfully run a data migration with stream changes enabled. Such a scenario can not be added to our test suite though because KCL only works with the real AWS servers (see scylladb#113)
…nt classes. We adapted the `KinesisReceiver` and its related classes to work with DynamoDB Streams, and we renamed it into `KinesisDynamoDBReceiver`. These classes are based on the code from the original `spark-kinesis-asl` module with some slight modifications based on the following resources: - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html - https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79 - and our previous fork implementation As a result, instead of maintaining a complete fork of `spark-kinesis-asl`, we only maintain a copy of the relevant classes, which should result in much faster build times (especially in the CI). It is still not possible to test the streaming feature locally (thus not in the CI either), see scylladb#113. These changes were tested with my actual AWS account.
…nt classes. We adapted the `KinesisReceiver` and its related classes to work with DynamoDB Streams, and we renamed it into `KinesisDynamoDBReceiver`. These classes are based on the code from the original `spark-kinesis-asl` module with some slight modifications based on the following resources: - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html - https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79 - and our previous fork implementation As a result, instead of maintaining a complete fork of `spark-kinesis-asl`, we only maintain a copy of the relevant classes, which should result in much faster build times (especially in the CI). It is still not possible to test the streaming feature locally (thus not in the CI either), see scylladb#113. These changes were tested with my actual AWS account. Fixes scylladb#119
…nt classes. We adapted the `KinesisReceiver` and its related classes to work with DynamoDB Streams, and we renamed it into `KinesisDynamoDBReceiver`. These classes are based on the code from the original `spark-kinesis-asl` module with some slight modifications based on the following resources: - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.CompleteProgram.html - https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79 - and our previous fork implementation As a result, instead of maintaining a complete fork of `spark-kinesis-asl`, we only maintain a copy of the relevant classes, which should result in much faster build times (especially in the CI). It is still not possible to test the streaming feature locally (thus not in the CI either), see scylladb#113. These changes were tested with my actual AWS account. Fixes scylladb#119
Commenting on this issue instead of creating a new one because this is related to the testing infrastructure. Currently, our testing infrastructure recreates the AWS stack (S3 and DynamoDB) in Docker containers. This works okay but comes with limitations:
While the first point could be (and should be, ideally) fixed by removing the hard-coded dependency on AWS, to address the second point we have no choice but having tests that use the real AWS. And, in practice, to fix the first point we would have to change things in our copy of the spark-kinesis project, which is undesirable (it is better to keep our copy as close as possible to the original so that we can merge the upstream improvements into our copy). I believe those points motivate the need for having tests that use the real AWS instead of a containerized implementation of AWS. Except for the benchmarks, such tests should not be expensive because they would not consume a lot of bandwidth. I propose the following course of action:
|
I'm worried that using AWS for testing in a public repo may lead to abuse. |
Regarding this point:
|
OK, if this is just for maintainers I'm good with that. |
The previous implementation was failing when used on AWS DynamoDB because `conf.get(DynamoDBConstants.ENDPOINT)` was `null`. This was not caught by our tests because our tests always use a custom endpoint (see scylladb#113)
The previous implementation was failing when used on AWS DynamoDB because `conf.get(DynamoDBConstants.ENDPOINT)` was `null`. This was not caught by our tests because our tests always use a custom endpoint (see #113)
In #107 we introduced a testing infrastructure that allows us to test several migration scenarios. Unfortunately, the
streamChanges
feature uses thespark-kinesis
module under the hood and this module performs calls to the real AWS servers instead of using the containerized service.Solutions to this problem could be to either fix the code of
spark-kinesis
to stay in the sandbox environment (this is a known issue, see localstack/localstack#677 and https://issues.apache.org/jira/browse/SPARK-27950), or to use something else thanspark-kinesis
.Related:
The text was updated successfully, but these errors were encountered: