Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks progress shows 0/1 (running) - segments #223

Open
anumod1234 opened this issue Dec 11, 2024 · 8 comments
Open

Tasks progress shows 0/1 (running) - segments #223

anumod1234 opened this issue Dec 11, 2024 · 8 comments

Comments

@anumod1234
Copy link

anumod1234 commented Dec 11, 2024

Hello ,

I was running the scylla migrator from AWS dynamodb to scylla cloud (alternator) ,
I have used vpc pairing between the aws spark vm and scylla cloud alternator cluster.

I have used below configurations -
--executor-memory 16G
--executor-cores 6
--driver-memory 8G

--Master
Mem - 248Gi
CPU(s): 8

--Worker(2)
Mem - 60Gi
CPU(s): 8

When I ran the spark job , the Tasks progress on port 4040 , shows 0/1 (running) , but ideally this number should be more and should show the segments progress.

Im able to see the data is loading into the target table (not completed)

Attached the screenshots & config.yaml.

Let me know if you need more details.

Image

config.dynamodb_yml.txt

@pdbossman
Copy link
Contributor

@julienrf made two adjustments to the config.yaml.

  1. Ran with 500, 500 for scanSegments and maxMapTasks. This produced 15 scan segments, running 12 concurrently.
  2. commented out scanSegments and maxMapTasks, and it ran same as 1 above.

When I look at my spreadsheet, we have 41GB of data and I was expecting ~345 scanSegments.

@pdbossman
Copy link
Contributor

Here's relevant info from source table:
"ProvisionedThroughput": {
"NumberOfDecreasesToday": 0,
"ReadCapacityUnits": 0,
"WriteCapacityUnits": 0
},
"TableSizeBytes": 44174574774,
"ItemCount": 237003926,

@pdbossman
Copy link
Contributor

@julienrf ... I want lots of scanSegments. I don't want big chunks I want lots of smaller chunks. It gives me the opportunity to add workers and increase the throughput of the system. When scanSegments are low, it restricts my flexibility to speed things up. I also cannot measure progress as well.

15 scanSegments is way too low, under 128MB per chunk. Why are we getting scanSegments so low?

@anumod1234
Copy link
Author

I have started a new job run to new target table , with below config- Attached updated yaml file.config.dynamodb_upd.yml.txt

The sparker job started with 200 segments (was running parallel - 16 at a time).
The job completed success in approx - 1 hr.

Source dynamo table count - 237,003,926 (Source)

Target Scylla cloud disk used (per node , total nodes - 3) - Approx 50 GB per instance (count run in progres)

submit-alternator-job.sh ==>>
--executor-memory 16G
--executor-cores 8
--driver-memory 8G \

--Master
Mem - 248Gi
CPU(s): 8
--Worker
Mem - 60Gi
CPU(s): 8

@tarzanek
Copy link
Contributor

@anumod1234
Copy link
Author

anumod1234 commented Dec 13, 2024

@tarzanek

Tried with below config -

1 x x2iedn.2xlarge - 8vCPU + 256GB ram to run spark-submit and master
2 x i4i.4xlarge - 16vCPU + 128G ram for worker nodes
250g disk

6 i4i.8xlarge. <-- Actual Scylla target for 100x cluster
250g disk

spark-env on both worker nodes ==>>

export MAX_CORES=12 # max number of cores to use on the machine
export MAX_MEMORY=8G # max amount of memory to use on the machine
export SPARK_WORKER_INSTANCES=3
#export SLAVESIZE="--cores $MAX_CORES --memory $MAX_MEMORY"

(Started with SPARK_WORKER_INSTANCES=2 and added 1 more during job run)

submit-alternator-job.sh
--driver-memory 64G \

The table loaded in 14 mins to the scylla cloud.

200 segments (parallel it was running 64 with worker instance - 2 & once added 3rd, it was running with 96 at at ime)

The CPU on worker nodes went up to 60 to 70 % (with Worker instance of 3 ).
Initially I started with Worker instance 4 , but was giving error (so started 2 and then added 1 more , didnt go for 4).

Scylla cloud screenshot attached.

Image

@anumod1234
Copy link
Author

I had one more run - completed in 11 Mins

Dynamo DB table record count (240 M)

Started with 2 workers , then 3 , then added 1 more (4) , without error.
If we add directly 3 or 4 , it gives error.

@pdbossman
Copy link
Contributor

We had to overprovision the worker instances with worker tasks for spark to increase it's throughput. We did prove current code base can still go fast, and we can continue to use old method of starting light with workers and increase them to increase throughput - which is also useful to control migration impact on the live workload using the source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants