This is a pattern not defined by the components used but how they send information back to the AWS X-Ray service to help you make your application perform better when viewed through the Serverless Well-Architected lens. A fully well architected solution would use embedded metric format for the logs like in the Julian Wood reference below but I am saving that for another pattern so as not to confuse the concepts.
Some useful references:
Author | Link |
---|---|
AWS X-Ray Developer Guide | X-Ray Developer Guide |
AWS X-Ray Concepts | X-Ray Concepts |
AWS Serverless Lens Whitepaper | Serverless Lens Whitepaper |
AWS Well Architected Whitepaper | Well Architected Whitepaper |
Julian Wood | Building Well Architected Applications |
AWS Developer Blog | Category: AWS X-Ray |
AWS Training | Introduction to AWS X-Ray |
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS. By using the Framework, you will learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.
We believe that having well-architected systems greatly increases the likelihood of business success.
Serverless Lens Whitepaper
Well Architected Whitepaper
Note - The content for this section is a subset of the Serverless Lens Whitepaper with some minor tweaks.
The operational excellence pillar includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.
OPS 1: How do you understand the health of your Serverless application?
Similar to non-serverless applications, anomalies can occur at larger scale in distributed systems. Due to the nature of serverless architectures, it’s fundamental to have distributed tracing.
Making changes to your serverless application entails many of the same principles of deployment, change, and release management used in traditional workloads. However, there are subtle changes in how you use existing tools to accomplish these principles.
Active tracing with AWS X-Ray should be enabled to provide distributed tracing capabilities as well as to enable visual service maps for faster troubleshooting.
X-Ray helps you identify performance degradation and quickly understand anomalies, including latency distributions.
Service Maps are helpful to understand integration points that need attention and resiliency practices. For integration calls, retries, backoffs, and possibly circuit breakers are necessary to prevent faults from propagating to downstream services.
Another example is networking anomalies. You should not rely on default timeouts and retry settings. Instead, tune them to fail fast if a socket read/write timeout happens where the default can be seconds if not minutes in certain clients.
X-Ray also provides two powerful features that can improve the efficiency on identifying anomalies within applications: Annotations and Subsegments. Subsegments are helpful to understand how application logic is constructed and what external dependencies it has to talk to. Annotations are key-value pairs with string, number, or Boolean values that are automatically indexed by AWS X-Ray.
Combined, they can help you quickly identify performance statistics on specific operations and business transactions, for example, how long it takes to query a database, or how long it takes to process pictures with large crowds.
I wanted to make this pattern as "real" as possible for people so I included most of the serverless components you will use everyday. I have included:
- API Gateway -> SNS -> Lambda (not SQS for reasons documented later)
- Lambda -> DynamoDB
- Lambda -> SQS -> Lambda
- Lambda -> External Http Endpoint
- Lambda -> SNS -> Lambda
When I map the flow in a high level conceptual image it looks like this:
After deployment, the X-Ray service map looks something like (You get two circles per Lambda):
Or if you look at CloudWatch Service Lens:
You can see that these two diagrams aren't a massive distance away from my high level conceptual flow, the difference is the X-Ray generated diagram is 100% accurate because it is created from real traces based on user flow. When viewed through the AWS Console, it cannot become the out of date diagram you found on a wiki last updated 12 months ago - it is always accurate. If a developer checks in a piece of code that changes the flow, you will see it immediately.
I will eventually refactor this pattern to be pure python but as of right now the lambdas are node.js and they need to install an external dependency that is X-Ray. To accomodate this I added an extra line into app.py to do this for you.
# install node dependencies for lambdas
subprocess.check_call("npm i".split(), cwd="lambdas")
I separated each of the different SNS subscriber flows above into their own CDK stacks and passed in the SNS Topic ARN as a parameter. Note, in a production system if you want to properly separate these stacks you could use AWS Systems Manager Parameter Store for the SNS Topic ARN.
if you look inside app.py you will see how this works:
xray_tracer = TheXrayTracerStack(app, "the-xray-tracer")
http_flow = TheHttpFlowStack(app, 'the-http-flow-stack', sns_topic_arn=xray_tracer.sns_topic_arn)
dynamo_flow = TheDynamoFlowStack(app, 'the-dynamo-flow-stack', sns_topic_arn=xray_tracer.sns_topic_arn)
sns_flow = TheSnsFlowStack(app, 'the-sns-flow-stack', sns_topic_arn=xray_tracer.sns_topic_arn)
sqs_flow = TheSqsFlowStack(app, 'the-sqs-flow-stack', sns_topic_arn=xray_tracer.sns_topic_arn)
http_flow.add_dependency(xray_tracer)
dynamo_flow.add_dependency(xray_tracer)
sns_flow.add_dependency(xray_tracer)
sqs_flow.add_dependency(xray_tracer)
All of this logic could have reasonably lived in one stack but this way I can add more technologies later that integrate with X-Ray without increasing the overall complexity of the pattern.
This also means you get 5 vanilla CloudFormation templates bundled:
- xray_tracer_template.yaml
- dynamo_flow_template.yaml
- http_flow_template.yaml
- sns_flow_template.yaml
- sqs_flow_template.yaml
After you deploy this pattern you will have an API Gateway with the URL being output in the CDK Deploy logs.
Any URL you hit on that gateway will trigger this flow, it uses your URL as the message sent to SNS.
This URL is inserted into DynamoDB with a counter of how many times it was hit, the SNS and the SQS consumer lambdas both log the message to CloudWatch.
To see the random service errors try to hit the URL at least 10 times then you can either navigate to the X-Ray section of the AWS Console followed by clicking on "Service Map" in the sidebar. Please be aware that sometimes there can be a 30 second or so delay before your calls show up in service map.
Alternatively you can go to CloudWatch in the AWS Console and click "Service Lens" in the sidebar. Both offer views onto the system.
I introduced a random SSL Cert error into the Lambda that connects to the External Http Endpoint to let you experiment with using X-Ray to source an error
Service map showing something isn't healthy:
Depending on the component you are using and what it is integrating with you need to enable X-Ray in a different way.
This is done by simply setting a property of tracing_enabled=True
on deployOptions:
api_gw.RestApi(self, 'xrayTracerAPI',
deploy_options=api_gw.StageOptions(
metrics_enabled=True,
logging_level=api_gw.MethodLoggingLevel.INFO,
data_trace_enabled=True,
tracing_enabled=True,
stage_name='prod'
))
You set a tracing property to _lambda.Tracing.ACTIVE
_lambda.Function(self, "httpLambdaHandler",
runtime=_lambda.Runtime.NODEJS_12_X,
handler="http.handler",
code=_lambda.Code.from_asset("lambdas"),
tracing=_lambda.Tracing.ACTIVE
)
These just pick up when being called from a component with tracing enabled, they do not need a specific setting to enable it.
You need to make sure your AWS SDK code is wrapped with X-Ray during invocation. This is true of any SDK calls e.g. Lambda to Lambda direct invoke, DynamoDB queries, Publishing to SNS etc
const AWSXRay = require('aws-xray-sdk');
// Wrap AWS SDK with X-Ray
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
exports.handler = async function(event:any) {
// Create an SQS service object as normal
var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
You need to wrap the https module with X-Ray:
const AWSXRay = require('aws-xray-sdk');
// Wrap HTTPS module with X-Ray
var https = AWSXRay.captureHTTPs(require('https'));
exports.handler = async function(event:any) {
// Make a call to a webservice as normal
const req = https.get("https://url.com", (res:any) => {
A segment can break down the data about the work done into subsegments. Subsegments provide more granular timing information and details about downstream calls that your application made to fulfill the original request. A subsegment can contain additional details about a call to an AWS service, an external HTTP API, or an SQL database. You can even define arbitrary subsegments to instrument specific functions or lines of code in your application.
I have included some custom subsegments in this pattern, like "external HTTP Request" below:
These are easy to create inside the Lambda Functions:
const subsegment = segment.addNewSubsegment('external HTTP Request');
let response = await new Promise((resolve:any, reject:any) => {
// Make a call to a webservice
const req = https.get("https://url.com", (res:any) => {
... //resolve promise
});
... //reject promise
});
subsegment.addMetadata("response", response)
subsegment.close();
You are allowed to put whole objects inside metadata, this is brilliant for showing things like the response from a webservice.
Metadata are key-value pairs that can have values of any type, including objects and lists, but are not indexed for use with filter expressions. Use metadata to record additional data that you want stored in the trace but don't need to use with search.
Annotations are key-value pairs with string, number, or Boolean values. Annotations are indexed for use with filter expressions. Use annotations to record data that you want to use to group traces in the console, or when calling the GetTraceSummaries API.
X-Ray groups enable customers to slice and dice their X-Ray service graph and focus on certain workflows, applications, or routes.
Customers can create a group by setting a filter expression. All the traces that match the set filter expression will be part of that group. Customers can then view service graphs for the selected group, and understand performance bottlenecks, errors, or faults in services belonging to that service graph.
Deep dive into AWS X-Ray groups and use cases
A useful tip is that Annotations can be used in filter expressions so you can easily create groups for smaller bounded contexts within a larger service map and then create CloudWatch alerts per group.
There are a couple of X-Ray quirks that I need to document, I thought it better to show them than refactor the pattern to hide them then you hit one later. If these are a deal breaker for you there are other tools out there for tracing that I have been promised will integrate with no extra code changes like Epsagon
There is a known bug where this doesn't connect and you end up with two paths on your service map.
I have included some logic inside the SQS subscriber lambda to move an X-Ray custom subsegment trace circle from the new second flow to where it should be but this is a workaround and hopefully that bug gets closed sooner than later.
X-Ray does work as expected with SNS when using the AWS SDK but for some reason when I do a direct integration with API Gateway through VTL the service map shows the subscribers of the SNS topic as being connected to API GW rather than SNS which is fine because I am not missing information but it's not correct. If I workout a fix for this I will update the pattern.
Unlike other tracing solutions I have used that let you position all of the circles where they give a feeling of inner calmness based on personal OCD; X-Ray randomly positions the circles on every refresh which can lead to some interesting map layouts. The important thing is being able to spot anomalies which you can definitely still do so this is purely aesthetic.
The cdk.json
file tells the CDK Toolkit how to execute your app.
This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the .env
directory. To create the virtualenv it assumes that there is a python3
(or python
for Windows) executable in your path with access to the venv
package. If for any reason the automatic creation of the virtualenv fails,
you can create the virtualenv manually.
To manually create a virtualenv on MacOS and Linux:
$ python3 -m venv .env
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
$ source .env/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
% .env\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
$ pip install -r requirements.txt
At this point you can now synthesize the CloudFormation template for this code.
$ cdk synth
To add additional dependencies, for example other CDK libraries, just add
them to your setup.py
file and rerun the pip install -r requirements.txt
command.
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
Enjoy!