-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks progress shows 0/1 (running) - segments #223
Comments
@julienrf made two adjustments to the config.yaml.
When I look at my spreadsheet, we have 41GB of data and I was expecting ~345 scanSegments. |
Here's relevant info from source table: |
@julienrf ... I want lots of scanSegments. I don't want big chunks I want lots of smaller chunks. It gives me the opportunity to add workers and increase the throughput of the system. When scanSegments are low, it restricts my flexibility to speed things up. I also cannot measure progress as well. 15 scanSegments is way too low, under 128MB per chunk. Why are we getting scanSegments so low? |
I have started a new job run to new target table , with below config- Attached updated yaml file.config.dynamodb_upd.yml.txt The sparker job started with 200 segments (was running parallel - 16 at a time). Source dynamo table count - 237,003,926 (Source) Target Scylla cloud disk used (per node , total nodes - 3) - Approx 50 GB per instance (count run in progres) submit-alternator-job.sh ==>> --Master |
https://github.com/scylladb/emr-dynamodb-connector/blob/scylla-5.x/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/read/AbstractDynamoDBInputFormat.java#L49 so if we need better we need to change above |
I had one more run - completed in 11 Mins Dynamo DB table record count (240 M) Started with 2 workers , then 3 , then added 1 more (4) , without error. |
We had to overprovision the worker instances with worker tasks for spark to increase it's throughput. We did prove current code base can still go fast, and we can continue to use old method of starting light with workers and increase them to increase throughput - which is also useful to control migration impact on the live workload using the source. |
Hello ,
I was running the scylla migrator from AWS dynamodb to scylla cloud (alternator) ,
I have used vpc pairing between the aws spark vm and scylla cloud alternator cluster.
I have used below configurations -
--executor-memory 16G
--executor-cores 6
--driver-memory 8G
--Master
Mem - 248Gi
CPU(s): 8
--Worker(2)
Mem - 60Gi
CPU(s): 8
When I ran the spark job , the Tasks progress on port 4040 , shows 0/1 (running) , but ideally this number should be more and should show the segments progress.
Im able to see the data is loading into the target table (not completed)
Attached the screenshots & config.yaml.
Let me know if you need more details.
config.dynamodb_yml.txt
The text was updated successfully, but these errors were encountered: