S3logsbeat is a beat to read logs from AWS S3 and send them to ElasticSearch. AWS uses S3 as destination for several internal logs: ALB, CloudFront, CloudTrail, etc. This beat is based on S3 event notifications to send notifications to an SQS queue when a new object is created on S3. Then, S3logsbeat polls these SQS queues, reads new objects messages (ignoring others), downloads S3 objects, parses logs to convert them into events, and finally publishes to ElasticSearch. If all events are published correctly, SQS message is deleted from SQS queue.
Lambdas are perfect when you have a few log entries. However, its price is based on how much time it takes to send events to ElasticSearch, if you have many log entries, its price could be eleveted.
As S3logsbeat consumes few resources (~20MB RAM on my tests), it can be installed on your ELK or on a container.
For instance, assume we have 10 million of new S3 objects of ALBs logs per month. Each object takes 1 second to be downloaded, parsed, and sent to our ElasticSearch (I'm assuming an average of 100 events per object). If we use the minimum amount of memory allowed by Lamba (128MB), our cost is (according to AWS Lambda pricing calculator): $2 for requests + $20.84 for execution = $22.84 / month.
As S3logsbeat can read until 10 SQS messages per request, we need to perform 1,000,000 request to obtain these 10 million of new S3 objects to SQS. As we read the message and delete it after processing, we need 2,000,000 request, which implies an SQS cost of $0.80/month. S3 costs would be: 10,000,000 GETs = $0.4 (as the instance in which S3logsbeat is running in the same region, I'm ignoring data transfer cost). The total cost with S3logs beat is: $1.20/month.
S3logsbeat has the following features:
- Limited workers to poll from SQS and download objects from S3 to avoid exceeding AWS request limits
- Usage of internal bounded queues to avoid overloading outputs
- If output is overloaded or inaccessible, no more messages are read from SQS
- High availability: you can have several S3logsbeat running in parallel
- Reliability: SQS messages are only deleted when output contains all events
- Avoid duplicates on supported outputs
- Supported several S3 log formats (see Suported log formats)
- Extra fields based on S3 key
- Delayed shutdown based on timout and pending messages to be acked by outputs
- Limited amount of resources: ~20MB RAM in my tests
However, it has some use cases not supported yet:
- S3 objects already present on bucket when S3 event notifications is activated are not processed
- CloudTrail logs not supported yet
- Custom logs not supported yet
Unsupported features are on the roadmap, so just wait for them.
We have to configure internal beat queue in order to reduce overloading on outputs. Edit your configuration file and add the following configuration on root:
queue.mem:
events: 128 # default: 4096
flush.min_events: 64 # default: 2048
In this way, if we try to send 128 messages to outputs and output is outline or saturated, publishing is blocked. As publishing is blocked, all internal bounded queues are filled eventually and no more SQS messages are read (avoiding extra requests to SQS or S3). Once output is available again, messages present on internal queues are sent to output and new messages are then read from SQS.
S3logsbeat is based on SQS Visibility timeout feature. When a message is read from SQS, it just "dissapears" from the queue. However, if message is not deleted, it reappears on the queue after visitiblity timeout expires, making it available to be read again. This is perfect for those cases in which a consumer deads unexpectedly.
S3logsbeat uses this principle. The workflow to process an SQS is the following:
- SQS message is read from an SQS queue
- New S3 objects appearing on SQS message are downloaded from S3 and parsed to generate events
- Each event is sent to output
- When output confirms the reception of all events related to an SQS message, SQS message is deleted from the queue
S3logsbeat can avoid duplicates on ElasticSearch output transparently by adding an event (document on ES) identifier based on its content.
It can also be used on Logstash output because a @metadata
field called _id
is added with this information. You
can configure your Logstash ES output
in the following way to take this value and use it as document identifier:
output {
elasticsearch {
···
document_id => "%{[@metadata][_id]}"
···
}
}
If you are sending several origin logs to the same S3 bucket and you want to distinguish them on ElasticSearch,
you can set a regular expression on key_regex_fields
in order to parse S3 keys and add extracted fields to
each event extracted from it.
For instance, imagine you are sending production and testing logs to the same S3 bucket from your ALBs. In order
to distinguish them, you added a prefix based on pattern {environment}-{application}/
to your ALBs configuration.
By default S3logsbeat is aware of this and generates events based on content, not from where them have been extracted. Due to that, you couldn't distinguish between production and testing logs. In order to make S3logsbeat aware of it, you have to configure your input as follows:
s3logsbeat:
inputs:
- type: sqs
queues_url:
- https://sqs.{aws-region}.amazonaws.com/{account ID}/{queue name}
log_format: alb
key_regex_fields: ^(?P<environment>[^\-]+)-(?P<application>[^/\-]+)
poll_frequency: 1m
By default, when S3logsbeat is stopped, SQS messages being processed are cancelled. It is not problematic because SQS message is not deleted until all events are present on output. Due to that, when you starts S3logsbeat again, SQS message cancelled is processed again. That combined with no duplicate events produces the same events on ElasticSearch as if initial message would be processed entirely.
However, we are doing the same process again on the same SQS message and the same S3 objects associated with it. In order to avoid it, we can configure a shutdown timeout as follows:
s3logsbeat:
shutdown_timeout: 5s
Then, when S3logsbeat is stopped, it stops to process new SQS messages and waits until (what happens first):
- All current SQS messages are processed, or
- Timeout expires
First case is the typical one when everything is ok because an SQS message is processed in less than a second and we have not to repeat the same process.
Second case can happen when we have a lot of S3 objects on pending SQS messages. This case is similar to the one exposed initially when shutdown_timeout
is not configured. It means, unprocessed S3 objects will be processed when S3logsbeat starts again.
S3logsbeat requires the following IAM permissions:
- SQS permissions:
sqs:ReceiveMessage
andsqs:DeleteMessage
. - S3 permissions:
s3:GetObject
.
IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage"
],
"Resource": "arn:aws:sqs:*:123456789012:<QUEUE_NAME>"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::<BUCKET-NAME>/*"
}
]
}
You may already have S3 log files when you configure an SQS queue to import new files via s3logsbeat
. If this is the case,
you can import those files by using the command s3imports
and a configuration file as this:
s3logsbeat:
inputs:
# S3 inputs (only taken into account when command `s3import` is executed)
-
type: s3
# S3
buckets:
- s3://mybucket/mypath
log_format: alb
# Optional fields extractor from key. E.g. key=staging-myapp/eu-west-1/2018/06/01/
key_regex_fields: ^(?P<environment>[^\-]+)-(?P<application>[^/]+)/(?P<awsregion>[^/]+)
since: 2018-10-15T01:00 # ISO8601 format - optional
to: 2018-11-20T01:00 # ISO8601 format - optional
Using command ./s3logsbeat s3imports -c config.yml
you can import all those S3 files that you already have on S3. This command
is not executed as a daemon and exits when all S3 objects are processed. Those SQS inputs present on configuration will be
ignored when command s3imports
is executed.
This command is useful on first import, however, you should take care because in combination with standard mode of s3logsbeat
can generate duplicates. In order to avoid this problem you can:
- Use
@metadata._id
in order to avoid duplicates on ElasticSearch (see section Avoid duplicates). - Configure the SQS queue and S3 event notifications. Wait until the first element is present on the queue. Via console or
cli, analyse the element present on the SQS queue without deleting it (it will reappear later). Then edit yaml configuration
and set the
to
property to just one second before the one obtained and executes3imports
command.
s3logsbeat
supports the following log formats:
elb
: parses Elastic Load Balancer (classic ELB) log.alb
: parses Application Load Balancer (ALB) log.cloudfront
: parses CloudFront logs.waf
: parses WAF logs.json
: parses JSON logs. Requires the following options (set via parameterlog_format_options
):timestamp_field
: field that represents the timestamp of log event. Mandatory.timestamp_format
: format in which timestamp is represented and from which should be converted into Date/Time. See Suported timestamp formats. Mandatory.
The following timestamp formats are supported:
timeUnixMilliseconds
: long or string with epoc millis.timeISO8601
: string with ISO8601 format.time:layout
: string with layout format present after prefixtime:
. Valid layouts correspond to ones parsed by time.Parse.
The following log event example is generated when log_format: alb
is present:
{
"_index" : "yourindex-2018.09.14",
"_type" : "doc",
"_id" : "678d4e087033da5bc429c1834941b71868b1cb6c",
"_score" : 1.0,
"_source" : {
"@timestamp" : "2018-09-14T19:11:02.796Z",
"trace_id" : "Root=1-5b9c07c6-686bc18f187f75f58d06cd63",
"target_group_arn" : "arn:aws:elasticloadbalancing:region:123456789012:targetgroup/yourtg/f3b5cb16d14453e9",
"application" : "yourapp",
"beat" : {
"hostname" : "macbook-pro-4.home",
"version" : "6.0.2",
"name" : "macbook-pro-4.home"
},
"environment" : "yourenvironment",
"client_port" : 37054,
"received_bytes" : 272,
"target_port" : 80,
"user_agent" : "cURL",
"request_url" : "https://www.example.com/",
"ssl_protocol" : "TLSv1.2",
"elb" : "app/yourenvironment-yourapp/ad4ceee8a897566c",
"request_proto" : "HTTP/1.0",
"response_processing_time" : 0,
"ssl_cipher" : "ECDHE-RSA-AES128-GCM-SHA256",
"client_ip" : "1.2.3.4",
"target_status_code" : 200,
"sent_bytes" : 11964,
"request_processing_time" : 0,
"elb_status_code" : 200,
"target_processing_time" : 0.049,
"request_verb" : "GET",
"type" : "https",
"target_ip" : "2.3.4.5"
}
}
These fields corresond to the ones established by AWS on this page.
The following log event example is generated when log_format: cloudfront
is present:
{
"_index" : "yourindex-2018.09.02",
"_type" : "doc",
"_id" : "4d2f4dceb5ed5b4a4300eccc657608a7cc57923d",
"_score" : 1.0,
"_source" : {
"@timestamp" : "2018-09-02T10:10:11.000Z",
"ssl_protocol" : "TLSv1.2",
"cs_protocol_version" : "HTTP/1.1",
"x_edge_response_result_type" : "Hit",
"cs_user_agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
"x_edge_request_id" : "9QuHsIZRJxPbF_F-3to6KberARZ7Ddd6wthSnivYJEWalJWqlYCe7A==",
"ssl_cipher" : "ECDHE-RSA-AES128-GCM-SHA256",
"x_edge_result_type" : "Hit",
"x_edge_location" : "MAD50",
"x_host_header" : "www.example.com",
"beat" : {
"name" : "macbook-pro-4.home",
"hostname" : "macbook-pro-4.home",
"version" : "6.0.2"
},
"cs_bytes" : 457,
"sc_status" : 200,
"cs_cookie" : "-",
"cs_protocol" : "https",
"cs_method" : "GET",
"cs_uri_stem" : "/",
"cs_host" : "d111111abcdef8.cloudfront.net",
"c_ip" : "1.2.3.4",
"time_taken" : 0.004,
"sc_bytes" : 1166
}
}
These fields corresond to the ones established by AWS on this page.
- Golang 1.7
To get running with S3logsbeat and also install the dependencies, run the following command:
make setup
It will create a clean git history for each major step. Note that you can always rewrite the history if you wish before pushing your changes.
To push S3logsbeat in the git repository, run the following commands:
git remote set-url origin https://github.com/sequra/s3logsbeat
git push origin master
For further development, check out the beat developer guide.
To build the binary for S3logsbeat run the command below. This will generate a binary in the same directory with the name s3logsbeat.
make
To run S3logsbeat with debugging output enabled, run:
./s3logsbeat -c s3logsbeat.yml -e -d "*"
To test S3logsbeat, run the following command:
make testsuite
alternatively:
make unit-tests
make system-tests
make integration-tests
make coverage-report
The test coverage is reported in the folder ./build/coverage/
Each beat has a template for the mapping in elasticsearch and a documentation for the fields
which is automatically generated based on fields.yml
by running the following command.
make update
To clean S3logsbeat source code, run the following commands:
make fmt
To clean up the build directory and generated artifacts, run:
make clean
To clone S3logsbeat from the git repository, run the following commands:
mkdir -p ${GOPATH}/src/github.com/sequra/s3logsbeat
git clone https://github.com/sequra/s3logsbeat ${GOPATH}/src/github.com/sequra/s3logsbeat
For further development, check out the beat developer guide.
The beat frameworks provides tools to crosscompile and package your beat for different platforms. This requires docker and vendoring as described above. To build packages of your beat, run the following command:
make package
This will fetch and create all images required for the build process. The hole process to finish can take several minutes.