Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EKS integration test for FluentBit log emission with Entity field #427

Merged
merged 4 commits into from
Nov 13, 2024

Conversation

zhihonl
Copy link
Contributor

@zhihonl zhihonl commented Nov 12, 2024

Description of the issue

FluentBit has a new feature to emit CloudWatch logs with entity field which helps users navigate between their telemetry in the CloudWatch console. We want to add an integration test which helps us validate this behavior.

This PR focuses on the test scenario where FluentBit falls back to using K8sWorkload as the service name and uses default environment name.

Description of changes

  • Add new terraform that uses helm chart to set up the cluster
  • Create a worker pod that output log every second
  • Add test code to validate the entity fields contain the necessary values.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/11790939699/job/32939527206 (Ran after commit 861e08b)

@zhihonl zhihonl requested a review from a team as a code owner November 12, 2024 05:14
@zhihonl zhihonl requested review from lisguo and varunch77 November 13, 2024 15:25
varunch77
varunch77 previously approved these changes Nov 13, 2024
role_arn = module.basic_components.role_arn
version = var.k8s_version
enabled_cluster_log_types = [
"api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these? Not sure what these do

Copy link
Contributor Author

@zhihonl zhihonl Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried running the setup without these and still works. I think these provide control plane logs from EKS under /aws/eks log group but we only care about application logs from Container Insights log group for these tests. Will remove in next commit.

// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT

module "common" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no common terraform to create an eks cluster? seems like we have duplicate code here from our EKS integ tests...can we just reuse it and then apply the helm chart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no common terraform to create EKS cluster. All terraform setups for each test under terraform/eks folder have their own cluster setup. I think your point is valid though, having a common code to avoid potential duplication is better but I think it's out of scope for this PR.

entityServiceNameSource = "@entity.Attributes.AWS.ServiceNameSource"
)

type expectedEntity struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be the cloudwatch.Entity struct? Then we can have checks to ensure the KeyAttributes/Attributes are valid in the struct?

Copy link
Contributor Author

@zhihonl zhihonl Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When querying the logs, the response doesn't return in cloudwatch.Entity type, it's Results [][]*ResultField which is basically arrays of strings. We could try to use cloudwatch.Entity just to hold the expected output but it would require additional logics to deference the maps just to turn them into strings so thought using a custom test struct is cleaner.


// ValidateEntity queries a given LogGroup/LogStream combination given the start and end times, and executes an
// log query for entity attributes to ensure the entity values are correct
func ValidateEntity(t *testing.T, logGroup, logStream string, end *time.Time, queryString string, expectedEntity expectedEntity, entityPlatformType string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func ValidateEntity(t *testing.T, logGroup, logStream string, end *time.Time, queryString string, expectedEntity expectedEntity, entityPlatformType string) {
func ValidateLogEntity(t *testing.T, logGroup, logStream string, end *time.Time, queryString string, expectedEntity expectedEntity, entityPlatformType string) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in next commit

t.Fatalf("application log group used for entity validation doesn't exsit: %s", logGroup)
}
begin := end.Add(-sleepForFlush * 2)
log.Printf("Start time is " + begin.String() + " and end time is " + end.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure these are accurate...how are we tracking the begin time? It should be when the terraform is initiated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Begin time is 2 time before the end point. This is the current flow:

  1. Terraform initialize all resources(cluster, helm chart, etc)
  2. Test starts running
  3. Sleep 3 minutes
  4. End time is set after 3 minutes sleep time
  5. Begin time is 6 minutes before end time since it's doing the *2 multiplication. We could remove the multiplication but I did it just to ensure we don't somehow miss a query by barely missing the start time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this assumes that the cluster was created within 6 minutes before the end time right? Wondering if there is a way to start the timer at terraform init. Not sure how we are doing it for our other tests

if !assert.NotZero(t, len(result)) {
return
}
if entityPlatformType == "EC2" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like we should have a defined constant for required KeyAttributes and Attributes for each of the platforms.

Something like

ec2RequiredKeyAttributes := map[string]struct{}{} {
    "Service": {},
    "Environment":{},
    ...
}

And have some validator in a util func. Having a map[string]struct{}{} is just easier to do the checks...not sure if we have a List abstraction for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to create a validator struct that performs different logics based on platforms.

@@ -197,6 +197,47 @@ func GetLogQueryStats(logGroupName string, startTime, endTime int64, queryString
}
}

// GetLogQueryResults for the log group between start/end (in epoch seconds) for the
// query string.
func GetLogQueryResults(logGroupName string, startTime, endTime int64, queryString string) ([][]types.ResultField, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have something like this already? Something that checks for the existence of logs in a log group within a certain time period?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have GetLogQueryStats function which does what you described but that returns the statistics of the query result which does not include log fields so we can't validate entity from that. GetLogQueryResults function is a copy of that function except that it returns results.Results

agentConfigPath: filepath.Join("resources", "compass_default_log.json"),
podName: "log-generator",
expectedEntity: expectedEntity{
entityType: "Service",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could have an enum for different entity types: Service, Resource

Same for platformType: AWS::EKS, AWS::K8s, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in the next CR where the values get re-used

@zhihonl zhihonl merged commit c0aafbb into main Nov 13, 2024
2 checks passed
@zhihonl zhihonl deleted the eks-entity branch November 13, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants