Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems I have encountered so far #2

Open
daklqw opened this issue Mar 7, 2025 · 0 comments
Open

Problems I have encountered so far #2

daklqw opened this issue Mar 7, 2025 · 0 comments

Comments

@daklqw
Copy link
Collaborator

daklqw commented Mar 7, 2025

Environment

  • OS: x86_64 Linux 6.13.5-arch1-1
  • Kind: 0.28.0-alpha
  • Kind backend: Docker version 28.0.1, build 068a01ea94
  • Memory: 32 GiB
  • Branch: main
  • Scenario: SRE

Issues (Based on the SRE scenario)

  1. Dependencies part 3: ansible-galaxy install -r requirements.yaml does not able to install without --force option on my machine.
  2. Step 1: I didn't notice 'update the kubeconfig field with the path to the configuration' at first.
    • I think it's a good idea to add some shell code like nano group_vars/all.yaml or sed xxx to the code block in step 1.
  3. Step 2: Elasticsearch elastic-elasticsearch-data often gets OOMKilled.
    • It falls into a CrashLoopBackOff loop, preventing deployment.
    • How I patch this:
  4. Step 8&9: Error from server (NotFound): services "topology-monitor" not found.
    • I couldn't find any instruction for topology-monitor. If there are any, please let me know.
  5. Some astronomy_shop pods also got OOMKilled.
    • It also prevents deployment.
    • How I patch this:
      • adService: increase the memory limit to 500Mi.
      • frauddetectionService: increase the memory limit to 500Mi.
      • kafka: increase the memory limit to 700Mi.
      • recommendationService: increase the memory limit to 700Mi.
    • In my test on astronomy shop, after patching (including elasticsearch), the total memory usage is 16GiB-17GiB, which is close to what the local cluster setup mentioned.
  6. The deployment took 15min, which is too long and not debug-friendly. Is there a way to disable certain options for a faster deployment?
  7. Grafana: I followed the instruction and still found errors in grafana page (http://localhost:8080/prometheus/alerting/list):

    Errors loading rules
    Failed to load the data source configuration for loki: Unable to fetch alert rules. Is the loki data source properly configured?

  8. Chaos Mesh: Only uses containerd as install_chaos_mesh.yaml shows.
    • Some Kubernetes clusters uses Docker as container runtime, which may cause this specificatio to not work on those clusters.
    • It would be better to mention this in README, like advising Docker runtime users to modify their install_chaos_mesh.yaml.
  9. INCIDENT_NUMBER=26&27:
  • Based on the patches to problem 2&5, I ran these two injections, but I didn't observe any change:
    • No pod/container errors in Kubernetes.
    • No alerts in Grafana Page.
  • If there are no issues in the code, please tell me how I can observe it.
  • Moreover, it often hangs on fault removing. (like INCIDENT_NUMBER=26 make remove_incident_fault)
    • Please let me know if any log is needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant