Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring the Spark cluster and the migration is manual and tedious #193

Open
julienrf opened this issue Aug 5, 2024 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@julienrf
Copy link
Collaborator

julienrf commented Aug 5, 2024

As mentioned in #191 and #192, setting up the Spark cluster and configuring the migrator to correctly utilize the Spark resources is manual and tedious.

How much of this could be automated? Ideally, users would only supply the table size and the throughput supported by the source and target tables, and they should get a Spark cluster correctly sized to transfer the data as efficiently as the source and target databases support.

Some ideas to explore:

  1. Provide such input as parameters of the Ansible playbook and automatically configure the corresponding Spark resources
  2. Publish a tool (e.g. using Pulumi or Terraform) that automatically provisions a cloud-based cluster (e.g. using AWS EC2), ready to run the migrator.
@julienrf julienrf added the enhancement New feature or request label Aug 5, 2024
@guy9
Copy link
Collaborator

guy9 commented Aug 6, 2024

Thanks @julienrf .
@GeoffMontee , @tarzanek , @pdbossman please have a look and add any input you might have.

@guy9
Copy link
Collaborator

guy9 commented Aug 25, 2024

ping @GeoffMontee , @tarzanek , @pdbossman

@pdbossman
Copy link
Contributor

Hello,
Usually the terms the user wants to deal with are "here are my items, average item size", then a migration target. I usually discount the migration target. Basically, if we can be clear about converting items and time into the throughput we want, then that works also.

@pdbossman
Copy link
Contributor

... by migration target, I mean duration. "I want the migration to complete in 10 hours", then compute that into throughput to indicate what the source and target would need to sustain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants