Tasks progress shows 0/1 (running) - segments #223

anumod1234 · 2024-12-11T19:34:05Z

Hello ,

I was running the scylla migrator from AWS dynamodb to scylla cloud (alternator) ,
I have used vpc pairing between the aws spark vm and scylla cloud alternator cluster.

I have used below configurations -
--executor-memory 16G
--executor-cores 6
--driver-memory 8G

--Master
Mem - 248Gi
CPU(s): 8

--Worker(2)
Mem - 60Gi
CPU(s): 8

When I ran the spark job , the Tasks progress on port 4040 , shows 0/1 (running) , but ideally this number should be more and should show the segments progress.

Im able to see the data is loading into the target table (not completed)

Attached the screenshots & config.yaml.

Let me know if you need more details.

config.dynamodb_yml.txt

pdbossman · 2024-12-11T21:01:11Z

@julienrf made two adjustments to the config.yaml.

Ran with 500, 500 for scanSegments and maxMapTasks. This produced 15 scan segments, running 12 concurrently.
commented out scanSegments and maxMapTasks, and it ran same as 1 above.

When I look at my spreadsheet, we have 41GB of data and I was expecting ~345 scanSegments.

pdbossman · 2024-12-11T21:04:53Z

Here's relevant info from source table:
"ProvisionedThroughput": {
"NumberOfDecreasesToday": 0,
"ReadCapacityUnits": 0,
"WriteCapacityUnits": 0
},
"TableSizeBytes": 44174574774,
"ItemCount": 237003926,

pdbossman · 2024-12-11T22:39:18Z

@julienrf ... I want lots of scanSegments. I don't want big chunks I want lots of smaller chunks. It gives me the opportunity to add workers and increase the throughput of the system. When scanSegments are low, it restricts my flexibility to speed things up. I also cannot measure progress as well.

15 scanSegments is way too low, under 128MB per chunk. Why are we getting scanSegments so low?

anumod1234 · 2024-12-12T14:49:25Z

I have started a new job run to new target table , with below config- Attached updated yaml file.config.dynamodb_upd.yml.txt

The sparker job started with 200 segments (was running parallel - 16 at a time).
The job completed success in approx - 1 hr.

Source dynamo table count - 237,003,926 (Source)

Target Scylla cloud disk used (per node , total nodes - 3) - Approx 50 GB per instance (count run in progres)

submit-alternator-job.sh ==>>
--executor-memory 16G
--executor-cores 8
--driver-memory 8G \

--Master
Mem - 248Gi
CPU(s): 8
--Worker
Mem - 60Gi
CPU(s): 8

tarzanek · 2024-12-12T15:53:58Z

https://github.com/scylladb/emr-dynamodb-connector/blob/scylla-5.x/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/read/AbstractDynamoDBInputFormat.java#L49
is setting auto split by 100MB segments

so if we need better we need to change above

anumod1234 · 2024-12-13T23:00:48Z

@tarzanek

Tried with below config -

1 x x2iedn.2xlarge - 8vCPU + 256GB ram to run spark-submit and master
2 x i4i.4xlarge - 16vCPU + 128G ram for worker nodes
250g disk

6 i4i.8xlarge. <-- Actual Scylla target for 100x cluster
250g disk

spark-env on both worker nodes ==>>

export MAX_CORES=12 # max number of cores to use on the machine
export MAX_MEMORY=8G # max amount of memory to use on the machine
export SPARK_WORKER_INSTANCES=3
#export SLAVESIZE="--cores $MAX_CORES --memory $MAX_MEMORY"

(Started with SPARK_WORKER_INSTANCES=2 and added 1 more during job run)

submit-alternator-job.sh
--driver-memory 64G \

The table loaded in 14 mins to the scylla cloud.

200 segments (parallel it was running 64 with worker instance - 2 & once added 3rd, it was running with 96 at at ime)

The CPU on worker nodes went up to 60 to 70 % (with Worker instance of 3 ).
Initially I started with Worker instance 4 , but was giving error (so started 2 and then added 1 more , didnt go for 4).

Scylla cloud screenshot attached.

anumod1234 · 2024-12-13T23:31:24Z

I had one more run - completed in 11 Mins

Dynamo DB table record count (240 M)

Started with 2 workers , then 3 , then added 1 more (4) , without error.
If we add directly 3 or 4 , it gives error.

pdbossman · 2024-12-13T23:39:39Z

We had to overprovision the worker instances with worker tasks for spark to increase it's throughput. We did prove current code base can still go fast, and we can continue to use old method of starting light with workers and increase them to increase throughput - which is also useful to control migration impact on the live workload using the source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks progress shows 0/1 (running) - segments #223

Tasks progress shows 0/1 (running) - segments #223

anumod1234 commented Dec 11, 2024 •

edited

Loading

pdbossman commented Dec 11, 2024

pdbossman commented Dec 11, 2024

pdbossman commented Dec 11, 2024

anumod1234 commented Dec 12, 2024

tarzanek commented Dec 12, 2024

anumod1234 commented Dec 13, 2024 •

edited

Loading

anumod1234 commented Dec 13, 2024

pdbossman commented Dec 13, 2024

Tasks progress shows 0/1 (running) - segments #223

Tasks progress shows 0/1 (running) - segments #223

Comments

anumod1234 commented Dec 11, 2024 • edited Loading

pdbossman commented Dec 11, 2024

pdbossman commented Dec 11, 2024

pdbossman commented Dec 11, 2024

anumod1234 commented Dec 12, 2024

tarzanek commented Dec 12, 2024

anumod1234 commented Dec 13, 2024 • edited Loading

anumod1234 commented Dec 13, 2024

pdbossman commented Dec 13, 2024

anumod1234 commented Dec 11, 2024 •

edited

Loading

anumod1234 commented Dec 13, 2024 •

edited

Loading