Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: How-to - Scaling #78

Open
Tracked by #39 ...
bmtcril opened this issue Jul 11, 2023 · 6 comments
Open
Tracked by #39 ...

Docs: How-to - Scaling #78

bmtcril opened this issue Jul 11, 2023 · 6 comments

Comments

@bmtcril
Copy link
Contributor

bmtcril commented Jul 11, 2023

No description provided.

@pomegranited
Copy link
Contributor

pomegranited commented Jan 25, 2024

@bmtcril these scaling doc tickets are nice and meaty.. mind if I take them up? CC @Ian2012

So far I've only run Aspects for dev, but I'd like to use this as an opportunity to get Aspects deployment working using openedx-k8s-harmony and Grove (like we proposed here), and sort out any kinks in that process that prevent us from using hosted database solutions or scaling self-hosted resources.

@bmtcril
Copy link
Contributor Author

bmtcril commented Jan 25, 2024

Please do, I'm sure there will be a lot to cover there. Some thoughts I've had percolating:

  • I'd generally recommend cloud providers for production, they're just going to have better support / automatic backups / and auto scaling and generally be much less of a headache / production pain point
  • We've done almost nothing in scaling Superset itself, I had wanted to double check that we could override both the MySQL database and redis cache for it if it becomes too resource intensive to share with LMS/Studio
  • I haven't check our autoscaling k8s resources, but we'd probably want to be able to scale up Ralph and various Superset nodes if necessary
  • I know @Ian2012 had to so some scaling of LMS/Studio Celery workers when Ralph was slower, we'll want to make sure that gets covered
  • I'd definitely love to hear how things go with Harmony and Grove, and get ahead of any issues there!

@Ian2012
Copy link
Contributor

Ian2012 commented Jan 25, 2024

@bmtcril Something we have found running in production is that this will fill up the LMS Celery low queue. It would be better to move the aspects-related tasks to the high queue, so it will perform better and doesn't block other LMS tasks.

For scaling, we need to consider the following aspects:

  • Ralph: We need to set up an autoscaling, based mainly on CPU usage as this is the main resource consumption on Ralph. If Ralph isn't scaled, it can be a hard bottleneck in Aspects as the aspects tasks will take till 2 minutes to finish at a Rate of 3 tracking logs per second.
  • Superset: Need to setup an autoscaling, based on CPU and RAM usage. Not sure what values should be considered, but probably the same as LMS or CMS would be fine.
  • LMS workers: If there is autoscaling, make sure to boost it before the initial deployment of Aspects, as it will allow the celery workers to be prepared for the new high-volume tasks. If there is no autoscaling, make sure to define one. See pod-autoscaling for more information.
  • Clickhouse: This would probably need a contribution on harmony with multiple setups for Clickhouse - ClickhouseKeepeer. With the Clickhouse Operator
    • 3 Clickhouse Keepeer nodes to form the quorum.
    • N Clickhouse nodes to perform the replication, with custom rules to assign 1 Clickhouse per high-resources node.
      Or for small installations that would scale later on:
    • 1 Clickhouse Keeper node.
    • 1 Clickhouse node.

@pomegranited
Copy link
Contributor

Hi @Ian2012 , @bmtcril mentioned you have a Clickhouse Helm chart from Altinity?

@Ian2012
Copy link
Contributor

Ian2012 commented Mar 14, 2024

@pomegranited The Clickhouse operator is the following: https://github.com/Altinity/clickhouse-operator/tree/master

For replication you would need the clickhouse-keeper or zookeeper, however Clickhouse Keeper is generally better in performance and avoid errors. You can setup with this documentation: https://github.com/Altinity/clickhouse-operator/blob/master/docs/zookeeper_setup.md

I would recommend this setup to have the Clickhouse Keeper quorum: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chk-examples/02-extended-3-nodes.yaml

For the Clickhouse replication I would recommend this setup with replication and persistent volumes: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-02-pod-template.yaml
Better if it includes resizable volumes if supported by the k8s provider: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-05-resizeable-volume-2.yaml

There are a lot of details that are not worth adding here, let me know if you need anything else

@bmtcril
Copy link
Contributor Author

bmtcril commented Mar 28, 2024

We'll also want to add some details about the new ERB batching, which is probably the single biggest pipeline change to impact scalability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants