Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Work in progress] Metricbeat - fix Azure duplicates #36778

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

tdancheva
Copy link
Contributor

@tdancheva tdancheva commented Oct 6, 2023

Proposed commit message

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@tdancheva tdancheva added the bug label Oct 6, 2023
@tdancheva tdancheva self-assigned this Oct 6, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 6, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 6, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @tdancheva? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@tdancheva
Copy link
Contributor Author

For context, you can look at this diagram.

Azure_fix_duplicates

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 6, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-18T16:16:33.246+0000

  • Duration: 53 min 21 sec

Test stats 🧪

Test Results
Failed 0
Passed 1479
Skipped 96
Total 1575

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@tdancheva tdancheva added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Oct 9, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 9, 2023
@tdancheva
Copy link
Contributor Author

Example for time grain 5 min, period 5 minutes
Azure_duplicates

Example for time grain 1 h, period 5 minutes
Azure_larger_timegrain

@kaiyan-sheng
Copy link
Contributor

Thanks @tdancheva for working on this!! The diagrams are super helpful. Two things here:

  1. For time grain 1 h, period 5 minutes: With this implementation, we will not have duplicates. But in 1-hour time range, metricbeat will collect 12 times and only 1 time gets data. This means extra API calls will be made and this might be a problem if cost goes up. Maybe still consider separating metrics with different timegrain to different data sets to avoid the extra API calls with empty returns? Either way, this change is needed :)

  2. Maybe we should introduce a latency parameter also to the azure monitor. So instead of default to have a delay of the interval, user can configure it based on the metrics.

@tdancheva
Copy link
Contributor Author

tdancheva commented Oct 19, 2023

Thanks @tdancheva for working on this!! The diagrams are super helpful. Two things here:

Thanks so much for the feedback @kaiyan-sheng !

  1. For time grain 1 h, period 5 minutes: With this implementation, we will not have duplicates. But in 1-hour time range, metricbeat will collect 12 times and only 1 time gets data. This means extra API calls will be made and this might be a problem if cost goes up. Maybe still consider separating metrics with different timegrain to different data sets to avoid the extra API calls with empty returns? Either way, this change is needed :)

With the change, it will only make API call to get the values once per hour for timegrain 1h, so we are not calling the API 12 times. I should change that diagram, and remove the time span except when the values are collected because that is the only time it will actually make the call. What happens in that case is, I remove the metric metadata from the client, and move to the next metric, skipping the API call.

This holds for any timegrain, when it is larger than the period, I couldn't test all because not all are present in our metric sets, but you can also have a look at the logic would be great, to make sure I am not missing anything.

In any case, we are looking with @zmoog into the possibility of separating as well.

  1. Maybe we should introduce a latency parameter also to the azure monitor. So instead of default to have a delay of the interval, user can configure it based on the metrics.

Agreed!! We had that discussion with @zmoog as well. Definitely, given the info from Azure around latency, we are overdoing it, especially if say the period is set to 1h. And we are not even documenting it or communicating to the user that that is what happens in the background.

@zmoog
Copy link
Contributor

zmoog commented Nov 3, 2023

@tdancheva, this solution is working fine However, there are metricset configurations that do not always specify the time grain.

Here's an example from the database account manifest.yml file:

    - resource_group: ""
      resource_type: "Microsoft.DocumentDb/databaseAccounts"
      metrics:
      - name: 
        - "AddRegion"
        - "RemoveRegion"
        - "UpdateAccountReplicationSettings"
        - "UpdateAccountNetworkSettings"
        - "UpdateAccountKeys"
        - "ServiceAvailability"
        - "ReplicationLatency"
        - "RegionFailover"
        - "DeleteAccount"
        - "CreateAccount"
        - "UpdateDiagnosticsSettings"
        namespace: "Microsoft.DocumentDb/databaseAccounts"

This metricset only specifies the metric name and namespace. I guess it's intentional, and probably because the metrics in this list have different time grains.

The API works with and without time grains. If we don't specify the time grain, Azure returns the data with some default time grain.

If we don't have the time grain from the config, we can't juggle the time intervals before making the request to the API.

So I tried a slightly different approach:

  1. Call the API — with or without time grain
  2. After the call, I keep track of the actual time grain (the API response always contain the time grain for the data returned)
  3. I also keep track of the collection timestamp
  4. On the next iteration, I check collection time and time grain to decide if get the data or skip the collection

Here's the implementation.

This approach seems working fine:

  • on the first iteration, the metric registry does not have any data, so the metricset collects all the values
  • on the following iterations, the metricset only collects new metric values if enough time has passed since the last collection timestamp.

Copy link
Contributor

mergify bot commented Nov 21, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b metricbeat-azure-duplicates upstream/metricbeat-azure-duplicates
git merge upstream/main
git push upstream metricbeat-azure-duplicates

1 similar comment
Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b metricbeat-azure-duplicates upstream/metricbeat-azure-duplicates
git merge upstream/main
git push upstream metricbeat-azure-duplicates

Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @tdancheva? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 26, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 26, 2024
Copy link
Contributor

mergify bot commented Dec 26, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b metricbeat-azure-duplicates upstream/metricbeat-azure-duplicates
git merge upstream/main
git push upstream metricbeat-azure-duplicates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants