Docs: How-to - Scaling #78

bmtcril · 2023-07-11T19:25:45Z

No description provided.

pomegranited · 2024-01-25T04:52:17Z

@bmtcril these scaling doc tickets are nice and meaty.. mind if I take them up? CC @Ian2012

So far I've only run Aspects for dev, but I'd like to use this as an opportunity to get Aspects deployment working using openedx-k8s-harmony and Grove (like we proposed here), and sort out any kinks in that process that prevent us from using hosted database solutions or scaling self-hosted resources.

bmtcril · 2024-01-25T13:50:27Z

Please do, I'm sure there will be a lot to cover there. Some thoughts I've had percolating:

I'd generally recommend cloud providers for production, they're just going to have better support / automatic backups / and auto scaling and generally be much less of a headache / production pain point
We've done almost nothing in scaling Superset itself, I had wanted to double check that we could override both the MySQL database and redis cache for it if it becomes too resource intensive to share with LMS/Studio
I haven't check our autoscaling k8s resources, but we'd probably want to be able to scale up Ralph and various Superset nodes if necessary
I know @Ian2012 had to so some scaling of LMS/Studio Celery workers when Ralph was slower, we'll want to make sure that gets covered
I'd definitely love to hear how things go with Harmony and Grove, and get ahead of any issues there!

Ian2012 · 2024-01-25T15:15:30Z

@bmtcril Something we have found running in production is that this will fill up the LMS Celery low queue. It would be better to move the aspects-related tasks to the high queue, so it will perform better and doesn't block other LMS tasks.

For scaling, we need to consider the following aspects:

Ralph: We need to set up an autoscaling, based mainly on CPU usage as this is the main resource consumption on Ralph. If Ralph isn't scaled, it can be a hard bottleneck in Aspects as the aspects tasks will take till 2 minutes to finish at a Rate of 3 tracking logs per second.
Superset: Need to setup an autoscaling, based on CPU and RAM usage. Not sure what values should be considered, but probably the same as LMS or CMS would be fine.
LMS workers: If there is autoscaling, make sure to boost it before the initial deployment of Aspects, as it will allow the celery workers to be prepared for the new high-volume tasks. If there is no autoscaling, make sure to define one. See pod-autoscaling for more information.
Clickhouse: This would probably need a contribution on harmony with multiple setups for Clickhouse - ClickhouseKeepeer. With the Clickhouse Operator
- 3 Clickhouse Keepeer nodes to form the quorum.
- N Clickhouse nodes to perform the replication, with custom rules to assign 1 Clickhouse per high-resources node.
  Or for small installations that would scale later on:
- 1 Clickhouse Keeper node.
- 1 Clickhouse node.

pomegranited · 2024-03-13T22:46:25Z

Hi @Ian2012 , @bmtcril mentioned you have a Clickhouse Helm chart from Altinity?

Ian2012 · 2024-03-14T15:11:57Z

@pomegranited The Clickhouse operator is the following: https://github.com/Altinity/clickhouse-operator/tree/master

For replication you would need the clickhouse-keeper or zookeeper, however Clickhouse Keeper is generally better in performance and avoid errors. You can setup with this documentation: https://github.com/Altinity/clickhouse-operator/blob/master/docs/zookeeper_setup.md

I would recommend this setup to have the Clickhouse Keeper quorum: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chk-examples/02-extended-3-nodes.yaml

For the Clickhouse replication I would recommend this setup with replication and persistent volumes: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-02-pod-template.yaml
Better if it includes resizable volumes if supported by the k8s provider: https://github.com/Altinity/clickhouse-operator/blob/master/docs/chi-examples/03-persistent-volume-05-resizeable-volume-2.yaml

There are a lot of details that are not worth adding here, let me know if you need anything else

bmtcril · 2024-03-28T18:15:19Z

We'll also want to add some details about the new ERB batching, which is probably the single biggest pipeline change to impact scalability.

bmtcril mentioned this issue Jul 11, 2023

Docs: Create initial documentation #57

Closed

30 tasks

This was referenced Mar 28, 2024

feat: adds scaling documentation #214

Closed

Docs: How-to - Scaling - Superset #79

Closed

Docs: How-to - Scaling - ClickHouse #80

Closed

saraburns1 added this to Data Working Group Sep 13, 2024

saraburns1 mentioned this issue Sep 17, 2024

Reorg/revamp Aspects Documentation openedx/wg-data#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: How-to - Scaling #78

Docs: How-to - Scaling #78

bmtcril commented Jul 11, 2023

pomegranited commented Jan 25, 2024 •

edited

Loading

bmtcril commented Jan 25, 2024

Ian2012 commented Jan 25, 2024 •

edited

Loading

pomegranited commented Mar 13, 2024

Ian2012 commented Mar 14, 2024

bmtcril commented Mar 28, 2024

Docs: How-to - Scaling #78

Docs: How-to - Scaling #78

Comments

bmtcril commented Jul 11, 2023

pomegranited commented Jan 25, 2024 • edited Loading

bmtcril commented Jan 25, 2024

Ian2012 commented Jan 25, 2024 • edited Loading

pomegranited commented Mar 13, 2024

Ian2012 commented Mar 14, 2024

bmtcril commented Mar 28, 2024

pomegranited commented Jan 25, 2024 •

edited

Loading

Ian2012 commented Jan 25, 2024 •

edited

Loading