Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify documentation of beats output errors #27763

Open
jsoriano opened this issue Sep 6, 2021 · 17 comments
Open

Clarify documentation of beats output errors #27763

jsoriano opened this issue Sep 6, 2021 · 17 comments

Comments

@jsoriano
Copy link
Member

jsoriano commented Sep 6, 2021

There are several metrics reporting output errors from the beat module, clarify their meaning in the docs, focusing on how worrisome these errors are for users.

For example it is not clear if a non-zero beat.stats.libbeat.output.write.errors implied some data loss, though it probably didn't if beat.stats.libbeat.output.events.dropped is zero.

For confirmed bugs, please report:

@jsoriano jsoriano added docs Metricbeat Metricbeat Team:Integrations Label for the Integrations team Feature:Stack Monitoring labels Sep 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@nkakouros
Copy link

The user facing part of this (i.e. the kibana interface) could also benefit from a bit more clarification. For instance, in the image below, what should I be worried about? Do I have dropped events that got lost forever without being logged? How does "Output Errors" relate to "Fail Rates"?

Screen Shot 2021-09-08 at 14 03 15

@tom-x1
Copy link

tom-x1 commented Oct 10, 2021

After reading the discussion and the docs I still don't know about the exact meaning of those metrics in the stack monitoring. I think we all agree that one of the most important questions a user has is: "Do I lose data or not?" I still can't answer this question. For example: I'm using packetbeat to gather DNS traffic and from time to time the "Failed in Pipeline" counter is jumping from 0 to 1000 and as stated above I'm asking myself: Do I lose those DNS queries or not?

Also I don't really get the difference between the two graphs "Fail Rates" and "Output Errors". To me it sounds like they are both doing the same, but they probably don't?

@sachin-frayne
Copy link
Contributor

I would also like to know the meaning of these values, especially beat.stats.libbeat.output.events.dropped, if that is larger than 0 am I losing data?

@6fears7
Copy link

6fears7 commented Oct 6, 2022

Inside the code for Libbeat:

	//
	// Output event stats
	//
	batches *monitoring.Uint // total number of batches processed by output
	events  *monitoring.Uint // total number of events processed by output

	acked      *monitoring.Uint // total number of events ACKed by output
	failed     *monitoring.Uint // total number of events failed in output
	active     *monitoring.Uint // events sent and waiting for ACK/fail from output
	duplicates *monitoring.Uint // events sent and waiting for ACK/fail from output
	dropped    *monitoring.Uint // total number of invalid events dropped by the output
	tooMany    *monitoring.Uint // total number of too many requests replies from output

	//
	// Output network connection stats
	//
	writeBytes  *monitoring.Uint // total amount of bytes written by output
	writeErrors *monitoring.Uint // total number of errors on write

	readBytes  *monitoring.Uint // total amount of bytes read
	readErrors *monitoring.Uint // total number of errors while waiting for response on output
}

When you're querying in Elastic for results of Libbeat (see below), the Output Errors is derived from the measured delta between the initial timestamp's readErrors + writeErrors and the latest timestamp's readErrors + writeErrors. According to the code commentary then, Output Errors is the number of network packets experiencing errors.

The below example is utilizing apm-server as the beat type, but you can replace it to suit your needs.

GET _search

{
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "term": {
                    "data_stream.dataset": "beats.stats"
                  }
                },
                {
                  "term": {
                    "metricset.name": "stats"
                  }
                },
                {
                  "term": {
                    "type": "beats_stats"
                  }
                }
              ]
            }
          },
          {
            "term": {
              "cluster_uuid": "CLUSTER_UUID"
            }
          },
          {
            "range": {
              "beats_stats.timestamp": {
                "format": "epoch_millis",
                "gte": 1665053615330,
                "lte": 1665054515330
              }
            }
          },
          {
            "bool": {
              "must": {
                "term": {
                  "beats_stats.beat.type": "apm-server"
                }
              }
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "beats_stats.metrics.beat.info.ephemeral_id",
      "inner_hits": {
        "name": "earliest",
        "size": 1,
        "sort": [
          {
            "beats_stats.timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          },
          {
            "@timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          }
        ]
      }
    },
    "sort": [
      {
        "beats_stats.beat.uuid": {
          "order": "asc",
          "unmapped_type": "long"
        }
      },
      {
        "timestamp": {
          "order": "desc",
          "unmapped_type": "long"
        }
      }
    ]
  }

@zez3
Copy link

zez3 commented Oct 6, 2022

According to the code commentary then, Output Errors is the number of network packets experiencing errors.

@6fears7
So, we have event loss or not?
I don't have any other errors, like 429

@zez3
Copy link

zez3 commented Oct 6, 2022

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard

image

  • I get an daily alarm telling me that,

eg.

monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.

That 3 mil./day events are kind of a lot from my customer point of view

@tom-x1
Copy link

tom-x1 commented Oct 7, 2022

@6fears7
Well that still does not answer if I have lost events or not. An "error" doesn't necessarily mean that I have lost events, I don't know if after those errors occur some kind of mechanism tries to process those faulty events again or not. Telling the users just "Error" without clearly letting them know the consequences of it does not help. A graph which says "Events lost" would be clear and helpful in my opinion.

@6fears7
Copy link

6fears7 commented Oct 7, 2022

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard

image

* I get an daily alarm telling me that,

eg.

monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.

That 3 mil./day events are kind of a lot from my customer point of view

In client.go, we are given an initial struct of:

type bulkResultStats struct {
	acked        int // number of events ACKed by Elasticsearch
	duplicates   int // number of events failed with `create` due to ID already being indexed
	fails        int // number of failed events (can be retried)
	nonIndexable int // number of failed events (not indexable)
	tooMany      int // number of events receiving HTTP 429 Too Many Requests
}

Later, we see what constitutes a drop:

failed := len(failedEvents)
	span.Context.SetLabel("events_failed", failed)
	if st := client.observer; st != nil {
		dropped := stats.nonIndexable 
		duplicates := stats.duplicates
		acked := len(data) - failed - dropped - duplicates

		st.Acked(acked)
		st.Failed(failed)
		st.Dropped(dropped)
		st.Duplicate(duplicates)
		st.ErrTooMany(stats.tooMany)
	}

So dropped events would be those of the nonIndexable type.

In order to determine what a "nonIndexable" type is, the code iterates through the Bulk results:

... 
		if status < 500 { 
			if status == http.StatusTooManyRequests {
				stats.tooMany++
			} else {
				// hard failure, apply policy action
				result, _ := data[i].Content.Meta.HasKey(dead_letter_marker_field)
				if result {
					stats.nonIndexable++
					client.log.Errorf("Can't deliver to dead letter index event %#v (status=%v): %s", data[i], status, msg)
					// poison pill - this will clog the pipeline if the underlying failure is non transient.
				} else if client.NonIndexableAction == dead_letter_index {
					client.log.Warnf("Cannot index event %#v (status=%v): %s, trying dead letter index", data[i], status, msg)
					if data[i].Content.Meta == nil {
						data[i].Content.Meta = common.MapStr{
							dead_letter_marker_field: true,
						}
					} else {
						data[i].Content.Meta.Put(dead_letter_marker_field, true)
					}
					data[i].Content.Fields = common.MapStr{
						"message":       data[i].Content.Fields.String(),
						"error.type":    status,
						"error.message": string(msg),
					}
				} else { // drop
					stats.nonIndexable++
					client.log.Warnf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)
					continue
				}

So any event that cannot be indexed (as an example here) will become tagged under the dead_letter_marker_field and dropped from the pipeline.

To answer your question, I do believe that those events are worth looking into. My first guess would be to check how the pipeline is being parsed and if there's some Grok that needs to be done.

@zez3
Copy link

zez3 commented Oct 9, 2022

Thank you for clarifying @6fears7

I use official pipelines from Fleet integrations but also custom ones. I also use dissect and the rest of filebeat processor.(dns is still missing from elasticsearch processor)
@sachin-frayne
Can you help now to further identify those events and/or check the pipelines? as sugested above.

@zez3
Copy link

zez3 commented Oct 9, 2022

Here is what I see in my Agent filbeat metrics

{
"_index": ".ds-metrics-elastic_agent.filebeat-ece-2022.09.28-000017",
"_id": "H4Apu4MBF-Kn07JyC-C-",
"_version": 1,
"_score": 0,
"_source": {
"agent": {
"name": "my_host.my.dom",
"id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa",
"ephemeral_id": "2a66e283-d27a-433b-8b39-ae7105878348",
"type": "metricbeat",
"version": "8.4.2"
},
"@timestamp": "2022-10-09T05:12:51.923Z",
"ecs": {
"version": "8.0.0"
},
"service": {
"address": "http://unix/stats",
"name": "beat",
"type": "beat"
},
"data_stream": {
"namespace": "ece",
"type": "metrics",
"dataset": "elastic_agent.filebeat"
},
"beat": {
"elasticsearch": {
"cluster": {
"id": "tOaF05W8QCG1Q3z25K4WZQ"
}
},
"stats": {
"handles": {
"limit": {
"hard": 524288,
"soft": 1024
},
"open": 148
},
"system": {
"load": {
"1": 7.16,
"5": 8.6,
"15": 8.53,
"norm": {
"1": 0.0559,
"5": 0.0672,
"15": 0.0666
}
},
"cpu": {
"cores": 128
}
},
"beat": {
"host": "my_host.my.dom",
"name": "my_host.my.dom",
"type": "filebeat",
"uuid": "a97378a4-c48f-45c2-901b-7b8b5f6b6502",
"version": "8.4.2"
},
"runtime": {
"goroutines": 289
},
"cpu": {
"total": {
"ticks": 13560390,
"time": {
"ms": 13560390
},
"value": 13560390
},
"system": {
"ticks": 1185830,
"time": {
"ms": 1185830
}
},
"user": {
"ticks": 12374560,
"time": {
"ms": 12374560
}
}
},
"memstats": {
"memory": {
"total": 1581247076272,
"alloc": 286642512
},
"rss": 546992128,
"gc_next": 322126080
},
"libbeat": {
"output": {
"read": {
"bytes": 515468121,
"errors": 0
},
"type": "elasticsearch",
"write": {
"bytes": 33985770829,
"errors": 0
},
"events": {
"batches": 479155,
"duplicates": 71111,
"total": 23936376,
"dropped": 23671,
"toomany": 0,
"active": 50,
"failed": 0,
"acked": 23841544
}
},
"pipeline": {
"clients": 20,
"queue": {
"acked": 23936326
},
"events": {
"total": 23940457,
"filtered": 12,
"dropped": 0,
"active": 4119,
"failed": 0,
"published": 23940442,
"retry": 2048
}
},
"config": {
"running": 0,
"stops": 0,
"starts": 20,
"reloads": 0
}
},
"cgroup": {
"memory": {
"mem": {
"usage": {
"bytes": 2462949376
},
"limit": {
"bytes": 53687091200
}
},
"id": "elastic-agent.service"
},
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
},
"id": "elastic-agent.service"
},
"cpuacct": {
"total": {
"ns": 4050343270691067
},
"id": "elastic-agent.service"
}
},
"uptime": {
"ms": 27149727
},
"info": {
"name": "filebeat",
"ephemeral_id": "09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42",
"version": "8.4.2",
"uptime": {
"ms": 27149727
}
}
},
"id": "a97378a4-c48f-45c2-901b-7b8b5f6b6502",
"type": "filebeat"
},
"host": {
"hostname": "my_host.my.dom",
"os": {
"kernel": "5.4.0-104-generic",
"codename": "focal",
"name": "Ubuntu",
"type": "linux",
"family": "debian",
"version": "20.04.5 LTS (Focal Fossa)",
"platform": "ubuntu"
},
"containerized": false,
"ip": [
"fe80::1ac0:4dff:fe7a:5796",
"1.1.1.1",
"fe80::ba59:9fff:fed2:222",
"1.1.1.1",
"192.168.43.1",
"fe80::42:73ff:fe00:5c0a",
"10.0.3.1",
"fe80::c5e:4ff:feff:61bb",
"fe80::b4c7:89ff:fe39:ef4d",
"fe80::8014:f6ff:fe19:bd2a",
"fe80::f49b:71ff:fef1:1993",
"fe80::9454:a4ff:fee7:cdbf",
"fe80::dcd6:6ff:fee9:c14d",
"fe80::34fa:29ff:fe12:af8a",
"fe80::866:4ff:fe83:8873",
"fe80::7059:d7ff:fe3d:76a4",
"fe80::446d:53ff:febc:ede3",
"fe80::841e:e9ff:feda:42e",
"fe80::6086:33ff:feee:6f81",
"fe80::d45a:b8ff:fe91:c903",
"fe80::14a4:11ff:fed1:a6a5",
"fe80::dc56:88ff:fe6d:a3ba",
"fe80::2c5d:f2ff:fec6:b0eb",
"fe80::2803:d9ff:fe72:8edd",
"fe80::8ce0:aff:fef4:f9df",
"fe80::2c9d:77ff:fed3:b27a",
"fe80::8875:6ff:fe86:64f9",
"fe80::bc5e:83ff:feb7:348a",
"fe80::d483:27ff:fecb:8415",
"fe80::c0ba:99ff:fe90:cedf",
"fe80::ac20:56ff:fe39:da07",
"fe80::7c8b:dfff:fe76:c1ee",
"fe80::7016:58ff:fef8:e362",
"fe80::5482:78ff:fefc:2464",
"fe80::586d:18ff:fe90:f306"
],
"name": "my_host.my.dom",
"id": "06d09b925a0246dcbd044c0036e01c9c",
"mac": [
"00:16:3e:00:00:00",
"02:42:73:00:5c:0a",
"0a:66:04:83:88:73",
"0e:5e:04:ff:61:bb",
"16:a4:11:d1:a6:a5",
"18:c0:4d:7a:57:96",
"18:c0:4d:7a:57:97",
"2a:03:d9:72:8e:dd",
"2a:f8:b1:d1:37:35",
"2e:5d:f2:c6:b0:eb",
"2e:9d:77:d3:b2:7a",
"36:fa:29:12:af:8a",
"46:6d:53:bc:ed:e3",
"56:82:78:fc:24:64",
"5a:6d:18:90:f3:06",
"62:86:33:ee:6f:81",
"72:16:58:f8:e3:62",
"72:59:d7:3d:76:a4",
"7e:8b:df:76:c1:ee",
"82:14:f6:19:bd:2a",
"86:1e:e9:da:04:2e",
"8a:75:06:86:64:f9",
"8e:e0:0a:f4:f9:df",
"96:54:a4:e7:cd:bf",
"ae:20:56:39:da:07",
"b6:c7:89:39:ef:4d",
"b8:59:9f:d2:02:22",
"be:5e:83:b7:34:8a",
"c2:ba:99:90:ce:df",
"d6:5a:b8:91:c9:03",
"d6:83:27:cb:84:15",
"de:56:88:6d:a3:ba",
"de:d6:06:e9:c1:4d",
"f6:9b:71:f1:19:93"
],
"architecture": "x86_64"
},
"elastic_agent": {
"id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa",
"version": "8.4.2",
"snapshot": false
},
"metricset": {
"period": 10000,
"name": "stats"
},
"event": {
"duration": 28331585,
"agent_id_status": "verified",
"ingested": "2022-10-09T05:12:52Z",
"module": "beat",
"dataset": "elastic_agent.filebeat"
}
},
"fields": {
"beat.stats.cgroup.cpu.id": [
"elastic-agent.service"
],
"beat.stats.libbeat.output.write.bytes": [
33985770829
],
"beat.stats.libbeat.pipeline.events.failed": [
0
],
"elastic_agent.version": [
"8.4.2"
],
"beat.stats.libbeat.pipeline.events.published": [
23940442
],
"host.os.name.text": [
"Ubuntu"
],
"host.hostname": [
"my_host.my.dom"
],
"host.mac": [
"00:16:3e:00:00:00",
"02:42:73:00:5c:0a",
"0a:66:04:83:88:73",
"0e:5e:04:ff:61:bb",
"16:a4:11:d1:a6:a5",
"18:c0:4d:7a:57:96",
"18:c0:4d:7a:57:97",
"2a:03:d9:72:8e:dd",
"2a:f8:b1:d1:37:35",
"2e:5d:f2:c6:b0:eb",
"2e:9d:77:d3:b2:7a",
"36:fa:29:12:af:8a",
"46:6d:53:bc:ed:e3",
"56:82:78:fc:24:64",
"5a:6d:18:90:f3:06",
"62:86:33:ee:6f:81",
"72:16:58:f8:e3:62",
"72:59:d7:3d:76:a4",
"7e:8b:df:76:c1:ee",
"82:14:f6:19:bd:2a",
"86:1e:e9:da:04:2e",
"8a:75:06:86:64:f9",
"8e:e0:0a:f4:f9:df",
"96:54:a4:e7:cd:bf",
"ae:20:56:39:da:07",
"b6:c7:89:39:ef:4d",
"b8:59:9f:d2:02:22",
"be:5e:83:b7:34:8a",
"c2:ba:99:90:ce:df",
"d6:5a:b8:91:c9:03",
"d6:83:27:cb:84:15",
"de:56:88:6d:a3:ba",
"de:d6:06:e9:c1:4d",
"f6:9b:71:f1:19:93"
],
"beat.stats.cpu.user.time.ms": [
12374560
],
"service.type": [
"beat"
],
"beat.stats.beat.type": [
"filebeat"
],
"beat.stats.libbeat.config.starts": [
20
],
"host.os.version": [
"20.04.5 LTS (Focal Fossa)"
],
"beat.stats.cgroup.cpu.cfs.quota.us": [
0
],
"agent.name": [
"my_host.my.dom"
],
"beat.stats.libbeat.config.stops": [
0
],
"beat.stats.cgroup.memory.id": [
"elastic-agent.service"
],
"event.agent_id_status": [
"verified"
],
"beat.stats.cpu.total.value": [
13560390
],
"beat.stats.libbeat.output.events.duplicates": [
71111
],
"host.os.type": [
"linux"
],
"beat.stats.libbeat.output.events.batches": [
479155
],
"beat.stats.libbeat.pipeline.events.filtered": [
12
],
"host.architecture": [
"x86_64"
],
"agent.id": [
"4ba5a4aa-848d-4435-a13e-ace9584cddaa"
],
"beat.stats.cpu.user.ticks": [
12374560
],
"beat.stats.cgroup.cpu.stats.periods": [
0
],
"host.containerized": [
false
],
"service.address": [
"http://unix/stats"
],
"beat.stats.libbeat.output.events.active": [
50
],
"beat.stats.libbeat.pipeline.events.total": [
23940457
],
"beat.stats.cpu.total.ticks": [
13560390
],
"beat.stats.info.version": [
"8.4.2"
],
"beat.stats.handles.limit.hard": [
524288
],
"beat.stats.runtime.goroutines": [
289
],
"beat.stats.beat.name": [
"my_host.my.dom"
],
"beat.stats.beat.version": [
"8.4.2"
],
"beat.stats.libbeat.pipeline.queue.acked": [
23936326
],
"beat.stats.libbeat.config.running": [
0
],
"host.ip": [
"fe80::1ac0:4dff:fe7a:5796",
"1.1.1.1",
"fe80::ba59:9fff:fed2:222",
"1.1.1.1",
"192.168.43.1",
"fe80::42:73ff:fe00:5c0a",
"10.0.3.1",
"fe80::c5e:4ff:feff:61bb",
"fe80::b4c7:89ff:fe39:ef4d",
"fe80::8014:f6ff:fe19:bd2a",
"fe80::f49b:71ff:fef1:1993",
"fe80::9454:a4ff:fee7:cdbf",
"fe80::dcd6:6ff:fee9:c14d",
"fe80::34fa:29ff:fe12:af8a",
"fe80::866:4ff:fe83:8873",
"fe80::7059:d7ff:fe3d:76a4",
"fe80::446d:53ff:febc:ede3",
"fe80::841e:e9ff:feda:42e",
"fe80::6086:33ff:feee:6f81",
"fe80::d45a:b8ff:fe91:c903",
"fe80::14a4:11ff:fed1:a6a5",
"fe80::dc56:88ff:fe6d:a3ba",
"fe80::2c5d:f2ff:fec6:b0eb",
"fe80::2803:d9ff:fe72:8edd",
"fe80::8ce0:aff:fef4:f9df",
"fe80::2c9d:77ff:fed3:b27a",
"fe80::8875:6ff:fe86:64f9",
"fe80::bc5e:83ff:feb7:348a",
"fe80::d483:27ff:fecb:8415",
"fe80::c0ba:99ff:fe90:cedf",
"fe80::ac20:56ff:fe39:da07",
"fe80::7c8b:dfff:fe76:c1ee",
"fe80::7016:58ff:fef8:e362",
"fe80::5482:78ff:fefc:2464",
"fe80::586d:18ff:fe90:f306"
],
"agent.type": [
"metricbeat"
],
"beat.stats.libbeat.output.read.errors": [
0
],
"beat.stats.memstats.gc_next": [
322126080
],
"elastic_agent.snapshot": [
false
],
"beat.id": [
"a97378a4-c48f-45c2-901b-7b8b5f6b6502"
],
"host.id": [
"06d09b925a0246dcbd044c0036e01c9c"
],
"beat.stats.system.load.15": [
8.53
],
"beat.stats.libbeat.pipeline.clients": [
20
],
"elastic_agent.id": [
"4ba5a4aa-848d-4435-a13e-ace9584cddaa"
],
"beat.stats.info.uptime.ms": [
27149727
],
"beat.stats.cgroup.cpuacct.total.ns": [
4050343270691067
],
"metricset.period": [
10000
],
"host.os.codename": [
"focal"
],
"beat.stats.libbeat.output.events.dropped": [
23671
],
"event.duration": [
28331585
],
"beat.stats.system.cpu.cores": [
128
],
"event.ingested": [
"2022-10-09T05:12:52.000Z"
],
"@timestamp": [
"2022-10-09T05:12:51.923Z"
],
"host.os.platform": [
"ubuntu"
],
"data_stream.dataset": [
"elastic_agent.filebeat"
],
"beat.stats.system.load.norm.15": [
0.0666
],
"agent.ephemeral_id": [
"2a66e283-d27a-433b-8b39-ae7105878348"
],
"beat.stats.handles.limit.soft": [
1024
],
"beat.stats.info.name": [
"filebeat"
],
"beat.type": [
"filebeat"
],
"beat.stats.libbeat.config.reloads": [
0
],
"beat.stats.libbeat.output.events.acked": [
23841544
],
"beat.stats.beat.uuid": [
"a97378a4-c48f-45c2-901b-7b8b5f6b6502"
],
"host.os.name": [
"Ubuntu"
],
"beat.stats.uptime.ms": [
27149727
],
"beat.stats.libbeat.output.type": [
"elasticsearch"
],
"host.name": [
"my_host.my.dom"
],
"beat.stats.libbeat.output.events.toomany": [
0
],
"beat.stats.libbeat.output.events.failed": [
0
],
"beat.stats.libbeat.pipeline.events.dropped": [
0
],
"beat.stats.beat.host": [
"idsece02.log.unibe.ch"
],
"data_stream.type": [
"metrics"
],
"ecs.version": [
"8.0.0"
],
"beat.stats.cgroup.cpu.cfs.period.us": [
100000
],
"agent.version": [
"8.4.2"
],
"host.os.family": [
"debian"
],
"beat.stats.cpu.total.time.ms": [
13560390
],
"beat.stats.libbeat.output.write.errors": [
0
],
"beat.stats.memstats.memory.total": [
1581247076272
],
"beat.stats.handles.open": [
148
],
"beat.stats.libbeat.output.events.total": [
23936376
],
"beat.stats.memstats.memory.alloc": [
286642512
],
"beat.stats.system.load.norm.5": [
0.0672
],
"beat.stats.system.load.norm.1": [
0.0559
],
"beat.stats.libbeat.pipeline.events.active": [
4119
],
"event.module": [
"beat"
],
"host.os.kernel": [
"5.4.0-104-generic"
],
"beat.stats.cgroup.cpu.stats.throttled.ns": [
0
],
"beat.stats.info.ephemeral_id": [
"09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42"
],
"beat.stats.cgroup.cpu.stats.throttled.periods": [
0
],
"beat.stats.cpu.system.time.ms": [
1185830
],
"service.name": [
"beat"
],
"beat.elasticsearch.cluster.id": [
"tOaF05W8QCG1Q3z25K4WZQ"
],
"data_stream.namespace": [
"ece"
],
"beat.stats.system.load.1": [
7.16
],
"beat.stats.cgroup.memory.mem.usage.bytes": [
2462949376
],
"beat.stats.cpu.system.ticks": [
1185830
],
"beat.stats.cgroup.memory.mem.limit.bytes": [
53687091200
],
"beat.stats.libbeat.output.read.bytes": [
515468121
],
"beat.stats.system.load.5": [
8.6
],
"beat.stats.cgroup.cpuacct.id": [
"elastic-agent.service"
],
"metricset.name": [
"stats"
],
"beat.stats.libbeat.pipeline.events.retry": [
2048
],
"beat.stats.memstats.rss": [
546992128
],
"event.dataset": [
"elastic_agent.filebeat"
]
}
}

my issue is with:

"beat.stats.libbeat.output.events.dropped": [
23671
],

and not with

"beat.stats.libbeat.pipeline.events.dropped": [
0
],

@6fears7
Does this mean that my events are not dropped in the pipelines?

@zez3
Copy link

zez3 commented Oct 9, 2022

So, I get a lot of such messages:

Click to expand!

{
"_index": ".ds-logs-elastic_agent.filebeat-ece-2022.09.28-000029",
"_id": "u0iIu4MBozNqveEQYO8z",
"_version": 1,
"_score": 0,
"_source": {
"container": {
"id": "elastic-agent-d3eb3e"
},
"agent": {
"name": "myhost.my.dom",
"id": "7b48d297-3b7b-4993-9525-8d398719e062",
"type": "filebeat",
"ephemeral_id": "b841ef51-a943-47b0-b224-2aa86de88024",
"version": "8.4.2"
},
"service.name": "filebeat",
"log": {
"file": {
"path": "/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson"
},
"offset": 378916652
},
"elastic_agent": {
"id": "7b48d297-3b7b-4993-9525-8d398719e062",
"version": "8.4.2",
"snapshot": false
},
"message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{"input_id":"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","raw_index":"logs-fortinet_fortigate.log-ece_forti","stream_id":"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","truncated":false}, Fields:{"agent":{"ephemeral_id":"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9","id":"7b48d297-3b7b-4993-9525-8d398719e062","name":"myhost.my.dom","type":"filebeat","version":"8.4.2"},"data_stream":{"dataset":"fortinet_fortigate.log","namespace":"ece_forti","type":"logs"},"ecs":{"version":"8.0.0"},"elastic_agent":{"id":"7b48d297-3b7b-4993-9525-8d398719e062","snapshot":false,"version":"8.4.2"},"event":{"dataset":"fortinet_fortigate.log"},"input":{"type":"udp"},"log":{"source":{"address":"1.1.1.1:24676"}},"message":"\u003c190\u003edate=2022-10-09 time=08:56:53 devname=\"UniLabFW\" devid=\"FG100FTK20025470\" eventtime=1665298613860494320 tz=\"+0200\" logid=\"0112053203\" type=\"event\" subtype=\"connector\" level=\"information\" vd=\"Border\" logdesc=\"Dynamic address updated.\" addr=\"FCTEMS_ALL_FORTICLOUD_SERVERS\" msg=\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\"","tags":["fortinet-fortigate","fortinet-firewall","forwarded"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'","caused_by":{"type":"illegal_argument_exception","reason":"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal."}}, dropping event!",
"log.logger": "elasticsearch",
"log.origin": {
"file.line": 429,
"file.name": "elasticsearch/client.go"
},
"input": {
"type": "filestream"
},
"@timestamp": "2022-10-09T06:56:54.742Z",
"ecs": {
"version": "8.0.0"
},
"data_stream": {
"namespace": "ece",
"type": "logs",
"dataset": "elastic_agent.filebeat"
},
"host": {
"hostname": "myhost.my.dom",
"os": {
"kernel": "5.4.0-125-generic",
"codename": "focal",
"name": "Ubuntu",
"type": "linux",
"family": "debian",
"version": "20.04.5 LTS (Focal Fossa)",
"platform": "ubuntu"
},
"containerized": false,
"ip": [
"fe80::1ac0:4dff:fe7a:560a",
"1.1.1.1",
"fe80::ba59:9fff:fed2:4e6",
"10.0.3.1",
"192.168.43.1",
"fe80::42:d3ff:fe9e:1acf",
"fe80::602f:cbff:fee8:aba1",
"fe80::b09f:c2ff:feda:9a3f",
"fe80::5c79:49ff:febe:1ace",
"fe80::d850:fbff:fe99:98fe",
"fe80::f872:d1ff:fece:32cc",
"fe80::ccd5:2dff:fe5b:32f8",
"fe80::a6:65ff:fe4d:3a83",
"fe80::80c8:c2ff:fe88:3e9d",
"fe80::88a3:97ff:feb0:d411",
"fe80::6c0e:2ff:fe33:8663",
"fe80::e8cd:80ff:fe0b:94e8",
"fe80::d429:7eff:fe77:efdf",
"fe80::b882:f4ff:fefa:32dc",
"fe80::cc7d:73ff:fe22:422f",
"fe80::f0cc:c5ff:fe65:d8ec",
"fe80::82:47ff:fe79:a2f9",
"fe80::680e:d3ff:fe45:9e6",
"fe80::2490:2aff:fe9f:1157",
"fe80::944b:eaff:fef8:1ce6",
"fe80::b4c8:fdff:fe59:6a6c",
"fe80::45b:edff:fe5d:305e",
"fe80::301b:16ff:fe60:7713",
"fe80::a42d:eff:fe18:442a",
"fe80::64cb:b9ff:fe97:6305",
"fe80::600b:ebff:fedb:2e4f",
"fe80::dc91:3bff:fe29:b75b",
"fe80::5042:7eff:fe93:f008",
"fe80::605f:84ff:fea7:9128"
],
"name": "myhost.my.dom",
"id": "aa3d81edad7148488d332274f0ec28e1",
"mac": [
"00:16:3e:00:00:00",
"02:42:d3:9e:1a:cf",
"02:82:47:79:a2:f9",
"02:a6:65:4d:3a:83",
"06:5b:ed:5d:30:5e",
"18:c0:4d:7a:56:0a",
"18:c0:4d:7a:56:0b",
"26:90:2a:9f:11:57",
"32:1b:16:60:77:13",
"52:42:7e:93:f0:08",
"5e:79:49:be:1a:ce",
"62:0b:eb:db:2e:4f",
"62:2f:cb:e8:ab:a1",
"62:5f:84:a7:91:28",
"66:cb:b9:97:63:05",
"68:05:ca:13:70:e2",
"68:05:ca:13:70:e3",
"6a:0e:d3:45:09:e6",
"6e:0e:02:33:86:63",
"82:c8:c2:88:3e:9d",
"8a:a3:97:b0:d4:11",
"96:4b:ea:f8:1c:e6",
"a6:2d:0e:18:44:2a",
"b2:9f:c2:da:9a:3f",
"b6:c8:fd:59:6a:6c",
"b8:59:9f:d2:04:e6",
"ba:82:f4:fa:32:dc",
"ce:7d:73:22:42:2f",
"ce:d5:2d:5b:32:f8",
"d6:29:7e:77:ef:df",
"da:50:fb:99:98:fe",
"de:91:3b:29:b7:5b",
"ea:cd:80:0b:94:e8",
"f2:cc:c5:65:d8:ec",
"fa:72:d1:ce:32:cc"
],
"architecture": "x86_64"
},
"log.level": "warn",
"event": {
"agent_id_status": "verified",
"ingested": "2022-10-09T06:57:00Z",
"dataset": "elastic_agent.filebeat"
}
},
"fields": {
"elastic_agent.version": [
"8.4.2"
],
"host.os.name.text": [
"Ubuntu"
],
"host.hostname": [
"myhost.my.dom"
],
"host.mac": [
"00:16:3e:00:00:00",
"02:42:d3:9e:1a:cf",
"02:82:47:79:a2:f9",
"02:a6:65:4d:3a:83",
"06:5b:ed:5d:30:5e",
"18:c0:4d:7a:56:0a",
"18:c0:4d:7a:56:0b",
"26:90:2a:9f:11:57",
"32:1b:16:60:77:13",
"52:42:7e:93:f0:08",
"5e:79:49:be:1a:ce",
"62:0b:eb:db:2e:4f",
"62:2f:cb:e8:ab:a1",
"62:5f:84:a7:91:28",
"66:cb:b9:97:63:05",
"68:05:ca:13:70:e2",
"68:05:ca:13:70:e3",
"6a:0e:d3:45:09:e6",
"6e:0e:02:33:86:63",
"82:c8:c2:88:3e:9d",
"8a:a3:97:b0:d4:11",
"96:4b:ea:f8:1c:e6",
"a6:2d:0e:18:44:2a",
"b2:9f:c2:da:9a:3f",
"b6:c8:fd:59:6a:6c",
"b8:59:9f:d2:04:e6",
"ba:82:f4:fa:32:dc",
"ce:7d:73:22:42:2f",
"ce:d5:2d:5b:32:f8",
"d6:29:7e:77:ef:df",
"da:50:fb:99:98:fe",
"de:91:3b:29:b7:5b",
"ea:cd:80:0b:94:e8",
"f2:cc:c5:65:d8:ec",
"fa:72:d1:ce:32:cc"
],
"log.logger": [
"elasticsearch"
],
"container.id": [
"elastic-agent-d3eb3e"
],
"host.ip": [
"fe80::1ac0:4dff:fe7a:560a",
"1.1.1.1",
"fe80::ba59:9fff:fed2:4e6",
"10.0.3.1",
"192.168.43.1",
"fe80::42:d3ff:fe9e:1acf",
"fe80::602f:cbff:fee8:aba1",
"fe80::b09f:c2ff:feda:9a3f",
"fe80::5c79:49ff:febe:1ace",
"fe80::d850:fbff:fe99:98fe",
"fe80::f872:d1ff:fece:32cc",
"fe80::ccd5:2dff:fe5b:32f8",
"fe80::a6:65ff:fe4d:3a83",
"fe80::80c8:c2ff:fe88:3e9d",
"fe80::88a3:97ff:feb0:d411",
"fe80::6c0e:2ff:fe33:8663",
"fe80::e8cd:80ff:fe0b:94e8",
"fe80::d429:7eff:fe77:efdf",
"fe80::b882:f4ff:fefa:32dc",
"fe80::cc7d:73ff:fe22:422f",
"fe80::f0cc:c5ff:fe65:d8ec",
"fe80::82:47ff:fe79:a2f9",
"fe80::680e:d3ff:fe45:9e6",
"fe80::2490:2aff:fe9f:1157",
"fe80::944b:eaff:fef8:1ce6",
"fe80::b4c8:fdff:fe59:6a6c",
"fe80::45b:edff:fe5d:305e",
"fe80::301b:16ff:fe60:7713",
"fe80::a42d:eff:fe18:442a",
"fe80::64cb:b9ff:fe97:6305",
"fe80::600b:ebff:fedb:2e4f",
"fe80::dc91:3bff:fe29:b75b",
"fe80::5042:7eff:fe93:f008",
"fe80::605f:84ff:fea7:9128"
],
"agent.type": [
"filebeat"
],
"host.os.version": [
"20.04.5 LTS (Focal Fossa)"
],
"host.os.kernel": [
"5.4.0-125-generic"
],
"host.os.name": [
"Ubuntu"
],
"log.level": [
"warn"
],
"agent.name": [
"myhost.my.dom"
],
"elastic_agent.snapshot": [
false
],
"host.name": [
"myhost.my.dom"
],
"event.agent_id_status": [
"verified"
],
"host.id": [
"aa3d81edad7148488d332274f0ec28e1"
],
"log.origin.file.line": [
429
],
"service.name": [
"filebeat"
],
"host.os.type": [
"linux"
],
"elastic_agent.id": [
"7b48d297-3b7b-4993-9525-8d398719e062"
],
"data_stream.namespace": [
"ece"
],
"host.os.codename": [
"focal"
],
"input.type": [
"filestream"
],
"log.offset": [
378916652
],
"message": [
"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{"input_id":"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","raw_index":"logs-fortinet_fortigate.log-ece_forti","stream_id":"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","truncated":false}, Fields:{"agent":{"ephemeral_id":"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9","id":"7b48d297-3b7b-4993-9525-8d398719e062","name":"myhost.my.dom","type":"filebeat","version":"8.4.2"},"data_stream":{"dataset":"fortinet_fortigate.log","namespace":"ece_forti","type":"logs"},"ecs":{"version":"8.0.0"},"elastic_agent":{"id":"7b48d297-3b7b-4993-9525-8d398719e062","snapshot":false,"version":"8.4.2"},"event":{"dataset":"fortinet_fortigate.log"},"input":{"type":"udp"},"log":{"source":{"address":"1.1.1.1:24676"}},"message":"\u003c190\u003edate=2022-10-09 time=08:56:53 devname=\"UniLabFW\" devid=\"FG100FTK20025470\" eventtime=1665298613860494320 tz=\"+0200\" logid=\"0112053203\" type=\"event\" subtype=\"connector\" level=\"information\" vd=\"Border\" logdesc=\"Dynamic address updated.\" addr=\"FCTEMS_ALL_FORTICLOUD_SERVERS\" msg=\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\"","tags":["fortinet-fortigate","fortinet-firewall","forwarded"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'","caused_by":{"type":"illegal_argument_exception","reason":"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal."}}, dropping event!"
],
"data_stream.type": [
"logs"
],
"host.architecture": [
"x86_64"
],
"event.ingested": [
"2022-10-09T06:57:00.000Z"
],
"@timestamp": [
"2022-10-09T06:56:54.742Z"
],
"log.origin.file.name": [
"elasticsearch/client.go"
],
"agent.id": [
"7b48d297-3b7b-4993-9525-8d398719e062"
],
"host.containerized": [
false
],
"ecs.version": [
"8.0.0"
],
"host.os.platform": [
"ubuntu"
],
"data_stream.dataset": [
"elastic_agent.filebeat"
],
"log.file.path": [
"/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson"
],
"agent.ephemeral_id": [
"b841ef51-a943-47b0-b224-2aa86de88024"
],
"agent.version": [
"8.4.2"
],
"host.os.family": [
"debian"
],
"event.dataset": [
"elastic_agent.filebeat"
]
}
}

In the above case they come from an official Fortinet integration

but I also see other from my custom parsing:


"message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 7, 17, 38, 79000000, time.UTC), Meta:{\"input_id\":\"udp-udp-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"raw_index\":\"logs-udp.generic-ece_wlc_logs\",\"stream_id\":\"udp-udp.generic-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42\",\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"client\":{\"mac\":\"22:7c:71:90:17:a5\"},\"data_stream\":{\"dataset\":\"udp.generic\",\"namespace\":\"ece_wlc_logs\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"action\":\"Client Authenticated\",\"dataset\":\"udp.generic\",\"provider\":\"APF-3-AUTHENTICATION_TRAP\",\"timezone\":\"CEST\"},\"input\":{\"type\":\"udp\"},\"log\":{\"logger\":\"haSSOServiceTask5\",\"source\":{\"address\":\"1.1.1.1:32837\"},\"syslog\":{\"hostname\":\"cisco-wlc-mgmt\"}},\"message\":\"\\u003c139\\u003ecisco-wlc-mgmt: *haSSOServiceTask5: Oct 09 09:17:38.079: %APF-3-AUTHENTICATION_TRAP: [PS]apf_80211.c:19558 Client Authenticated: MACAddress:22:7c:71:90:17:a5 Base Radio MAC:00:f2:8b:4d:bf:00 Slot:1 User Name:unknown Ip Address:unknown SSID:public-unibe\",\"oldwlc\":{\"orig\":{}},\"orig\":{\"timestamp\":\"Oct 09 09:17:38.079\"},\"source\":{\"ip\":\"unknown\",\"mac\":\"22:7c:71:90:17:a5\"},\"tags\":[\"wlc\",\"forwarded\",\"_dns_reverse_lookup_failed\"],\"user\":{\"name\":\"unknown\"},\"wlc\":{\"baseradiomac\":\"00:f2:8b:4d:bf:00\",\"ssid\":\"public-unibe\"}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [source.ip] of type [ip] in document with id 'YWObu4MBF-Kn07JyVca4'. Preview of field's value: 'unknown'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'unknown' is not an IP string literal.\"}}, dropping event!",

@nkakouros
Copy link

I think it is important, whatever the explanation given here, to update the Kibana UI to explain the metrics in layman's terms and to give optical feedback of when data loss is occurring, together with links to docs on how to deal with this.

@botelastic
Copy link

botelastic bot commented Oct 11, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 11, 2023
@ajax-shmyrko-o
Copy link

ajax-shmyrko-o commented Jan 5, 2024

+1, We are using Datadog to collect these metrics:

  • libbeat.output.events.dropped
  • libbeat.pipeline.events.dropped
  • libbeat.output.events.failed
  • libbeat.pipeline.events.failed

So it would be nice to have an explanation about their meaning in the documentation.

@botelastic botelastic bot removed the Stalled label Jan 5, 2024
@botelastic
Copy link

botelastic bot commented Jan 4, 2025

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants