This package creates a Glue Python Shell Job that will automatically generate a Glue table that can be used to query AWS Config Snapshot data stored in an S3 Bucket. This is useful for performing advanced queries of Config data using Amazon Athena.
This script uses the AWS Data Wrangler package execute the Athena SQL statement that creates the table.
- The script runs a Glue Python Shell Job and generates a table in Glue that contains Config snapshot data.
- Config snapshot objects follow this prefix naming convention:
s3://[bucketname]/[optional-prefix]/AWSLogs/[accountId]/Config/[region]/YYYY/m/d/ConfigSnapshot/[snapshotName].json.gz
. A table is registered in the Glue data catalog with the Config Snapshot schema, and uses partition projections to handle partitioning of the data. This enables the config data to be queried in-place. - The Python script identifies the organization id, accounts and regions, then creates
enum
partition projections in the table properties for these values. - A Glue Trigger, when activated, will execute the Glue Python Shell Job once every 12 hours to recreate the table - this is needed in to update the
enum
partition projects (i.e. if new accounts or regions begin sending data to the bucket)
Below is the SQL statement that is used to generate the table. The enum
value placeholders are populated by the script.
CREATE EXTERNAL TABLE `:database_name;`.`:table_name;`
(
fileversion STRING,
configSnapshotId STRING,
configurationitems ARRAY < STRUCT <
configurationItemVersion : STRING,
configurationItemCaptureTime : STRING,
configurationStateId : BIGINT,
awsAccountId : STRING,
configurationItemStatus : STRING,
resourceType : STRING,
resourceId : STRING,
resourceName : STRING,
relationships: ARRAY < STRING >,
ARN : STRING,
awsRegion : STRING,
availabilityZone : STRING,
configurationStateMd5Hash : STRING,
configuration : STRING,
supplementaryConfiguration : MAP < STRING, STRING >,
tags: MAP < STRING, STRING >,
resourceCreationTime : STRING
> >
)
PARTITIONED BY (
`org` string,
`account` string,
`region` string,
`date` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://:bucket_name;/'
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.account.type" = "enum",
"projection.account.values" = ":accounts;",
"projection.region.type" = "enum",
"projection.region.values" = ":regions;",
"projection.org.type" = "enum",
"projection.org.values" = ":orgs;",
"projection.date.format" = "yyyy/M/d",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"projection.date.range" = ":begindate;,NOW",
"projection.date.type" = "date",
"storage.location.template" = "s3://:bucket_name;/${org}/AWSLogs/${account}/Config/${region}/${date}/ConfigSnapshot"
)
- AWS Control Tower with AWS Config Enabled: The template and scripts assume that the Config Snapshot data is being stored in a bucket in the Control Tower Log Archive account.
The CloudFormation template used in the setup below creates the following resources:
-
Glue Python Shell Job- This shell job contains a Python scripts that will perform the table creation.
-
Glue Database- The database will be used to store metadata (i.e. table schemas) for for the Config snapshot data. This is primarily so we can perform queries in Athena later.
-
S3 Bucket- This bucket will store Python shell scripts. Glue Python Shell Jobs execute a Python script that is stored in S3. Additional Python libraries can also be loaded from .whl and .egg files located in S3.
-
IAM Role for Glue Jobs- Glue Jobs require an IAM role with permissions to access the resources they will interact with.
The Python script used by Glue leverages the AWS Data Wrangler library. This package extends the popular Pandas library to AWS services, making it easy to connect to, load, and save Pandas dataframes with many AWS services, including S3, Glue, Redshift, EMR, Athena, and Cloudwatch Log Insights. In a few lines of code, the script performs the following steps:
- Iterates prefixes in the Control Tower Logs bucket to find organization identifiers, account numbers, and regions associated with the data.
- Creates a table in the Glue Database with the Config snapshot data, using the organization identifiers, accounts, and regions as
enum
values in partition projections for the table.
** Example Query**
An Athena saved query is created. The query is named EC2InstanceSecurityGroupRelationships
and returns all EC2 instances and their associated security groups for every account and region that is sending Config snapshots to the Control Tower Bucket.
IMPORTANT: This template is intended to be run in the account where the S3 bucket containing your Config snapshots is located. It should also be run in region where the bucket is located. Note: This script has been tested in a Control Tower environment, deployed in the "LogArchive" account. The script assumes the logs follow this s3 prefix structure: s3://:bucket_name;/${org}/AWSLogs/${account}/Config/${region}/${date}/ConfigSnapshot
-
Log into the account where Config snapshot data is located and going to the bucket's region.
-
Deploy the CloudFormation template, setup-generate_config_snapshot_table.yml. Optionally, specify the source bucket name. If you do not specify a name, the standard Control Tower logs bucket name
aws-controltower-logs-[accountid]-[region]
is used. -
After the CloudFormation stack is created, go to outputs and click the link to the S3 bucket that the stack created.
-
Upload: awswrangler-2.5.0-py3-none-any.whl and ct_generate_config_snapshot_table.py
-
Go to the AWS Glue console and run the
ct-generate-config-snapshot-table
job. -
Go to the Glue Triggers Page. Select the
ct-generate-config-snapshot-table
, then click Action > Activate Trigger. The job will now run once every 12 hours to regenerate the table. -
Go to the Athena Saved Queries page and select
EC2InstanceSecurityGroupRelationships
. The following query should be displayed:SELECT account, date, configurationItem.resourceId, configurationItem.resourceType, json_extract_scalar(relationship, '$.resourceid') related_resource, json_extract_scalar(relationship, '$.resourcetype') "related_type" FROM "aws_config_data"."aws_config_configuration_snapshot" CROSS JOIN UNNEST(configurationitems) AS t(configurationItem) CROSS JOIN UNNEST(configurationItem.relationships) AS t(relationship) WHERE region = 'us-east-1' AND date = date_format(current_date - interval '1' day, '%Y/%c/%e') AND configurationItem.resourceType = 'AWS::EC2::Instance' AND json_extract_scalar(relationship, '$.resourcetype') = 'AWS::EC2::SecurityGroup' ORDER BY account, resourceid, related_resource
Execute the query. It should return a list EC2 instances and associated security groups for all AWS accounts and regions from yesterday's Config snapshot data.