Problems I have encountered so far #2

daklqw · 2025-03-07T22:54:18Z

Environment

OS: x86_64 Linux 6.13.5-arch1-1
Kind: 0.28.0-alpha
Kind backend: Docker version 28.0.1, build 068a01ea94
Memory: 32 GiB
Branch: main
Scenario: SRE

Issues (Based on the SRE scenario)

Dependencies part 3: ansible-galaxy install -r requirements.yaml does not able to install without --force option on my machine.
Step 1: I didn't notice 'update the kubeconfig field with the path to the configuration' at first.
- I think it's a good idea to add some shell code like nano group_vars/all.yaml or sed xxx to the code block in step 1.
Step 2: Elasticsearch elastic-elasticsearch-data often gets OOMKilled.
- It falls into a CrashLoopBackOff loop, preventing deployment.
- How I patch this:
  - In install_elasticsearch.yaml, I added value: data.resourcesPreset="large".
  - As this README said, data.resourcesPreset is set to medium by default.
  - According to this template file, I chose large.
Step 8&9: Error from server (NotFound): services "topology-monitor" not found.
- I couldn't find any instruction for topology-monitor. If there are any, please let me know.
Some astronomy_shop pods also got OOMKilled.
- It also prevents deployment.
- How I patch this:
  - adService: increase the memory limit to 500Mi.
  - frauddetectionService: increase the memory limit to 500Mi.
  - kafka: increase the memory limit to 700Mi.
  - recommendationService: increase the memory limit to 700Mi.
- In my test on astronomy shop, after patching (including elasticsearch), the total memory usage is 16GiB-17GiB, which is close to what the local cluster setup mentioned.
The deployment took 15min, which is too long and not debug-friendly. Is there a way to disable certain options for a faster deployment?
Grafana: I followed the instruction and still found errors in grafana page (http://localhost:8080/prometheus/alerting/list):

Errors loading rules
Failed to load the data source configuration for loki: Unable to fetch alert rules. Is the loki data source properly configured?
Chaos Mesh: Only uses containerd as install_chaos_mesh.yaml shows.
- Some Kubernetes clusters uses Docker as container runtime, which may cause this specificatio to not work on those clusters.
- It would be better to mention this in README, like advising Docker runtime users to modify their install_chaos_mesh.yaml.
INCIDENT_NUMBER=26&27:

Based on the patches to problem 2&5, I ran these two injections, but I didn't observe any change:
- No pod/container errors in Kubernetes.
- No alerts in Grafana Page.
If there are no issues in the code, please tell me how I can observe it.
Moreover, it often hangs on fault removing. (like INCIDENT_NUMBER=26 make remove_incident_fault)
- Please let me know if any log is needed.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems I have encountered so far #2

Problems I have encountered so far #2

daklqw commented Mar 7, 2025 •

edited

Loading

Problems I have encountered so far #2

Problems I have encountered so far #2

Comments

daklqw commented Mar 7, 2025 • edited Loading

Environment

Issues (Based on the SRE scenario)

daklqw commented Mar 7, 2025 •

edited

Loading