Update Workers Runbook

This is intended to provide resources to refer to in general but also in particular when there is a problem with the update workers.

What are the update workers?

The update workers are the processes concerned with posting our customers' posts to the social networks. They run on Kubernetes and there are over 200 of these processes. They listen to AWS SQS queues and act on the messages therein - each message containing an update id and some additional meta-data.

DataDog dashboards

Twitter
Facebook
Instagram
LinkedIn
Google
Pinterest

These dashboards split the various aspects of the social networks down across various dimensions. So, for example, it is possible to see how successful image posts to Facebook groups are. Or, for another example, to see how successful video posts to Twitter are. Often, different post types are treated differently by the social networks, so this can be valuable to see if any problem is isolated to some particular network & post type.

Rather than clicking between all these different dashboards, there is also a generic dashboard which can be used to drill down into specific combinations. This can be used cross-network as well - for example, to investigate if there might be any global image problem when posting from Buffer (as might happen if our S3 connection fails, for example).

DataDog monitors

There are a number of monitors set up in DataDog around this area. While generally intended to act as triggers of alerts, they can also be useful to take a look at, to see what 'normal' looks like.

Consistent update broadcaster delay problem
Updates queue backlog
Abnormal Error Rates
Twitter posting delay
Lower percentage of on-time updates
Not enough update workers on Kubernetes running
Updates look to be delayed!

In case of problems with the update workers

Look to see what networks are affected
Look to see what post types are affected

Next steps depend on the answers to these questions.

If the problem is localised to a single network

(Here, Facebook and Instagram may or may not count as a single network - use your best judgement!)

Try posting directly on the service
Scan Twitter for any reports of problems (provided Twitter isn't down!)
Look at the network's status page for any clues

If you are unable to post direcly, that would suggest a problem at the network's end. In this situation, there's not too much we can do except to provide information to the customers:

Liaise with advocates with a view to potentially putting up an in-app banner and updating our status page
See if you can reproduce the problem and see what error (if any) returns from the network
If the error is detectable and new, potentially deploy a change in the error handling to give a more intuitive error to customers

If you are able to post directly, that suggests something might have broken at Buffer's end.

See if there are any recent deploys that might be relevant
If so, look into rolling these back to see if that gives any improvement

If it's still not clear, it's time to roll up the sleeves and investigate!

Liaise with advocates with a view to potentially putting up an in-app banner and updating our status page
Make use of a dev server for some live debugging to find out what might be going wrong
Try and find out if we are making a request to the social network (we should be) and what response it might be returning (if any)
Report progress and investigation in the #eng-incidents Slack channel

If the problem is not localised to a single network

In this case, the problem is likely something wrong at Buffer's end, or else some sort of infrastructure / network problem.

Check if the Buffer Publish app itself is working at https://buffer.com
Check for any recent deploys that may have have an effect - rolling back any that may be the cause
Try posting directly on some of the services
Scan Twitter for any reports of problems
Look at the AWS status page

At this point it may still not be clear what the problem is and it's time to investigate by:

Making use of a dev server for some live debugging to find out what might be going wrong
Trying and find out if we are making a request to the social network (we should be) and what response it might be returning (if any)
Reporting progress and investigation in the #eng-incidents Slack channel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update-workers.md

update-workers.md

Update Workers Runbook

What are the update workers?

DataDog dashboards

DataDog monitors

In case of problems with the update workers

If the problem is localised to a single network

If the problem is not localised to a single network

Files

update-workers.md

Latest commit

History

update-workers.md

File metadata and controls

Update Workers Runbook

What are the update workers?

DataDog dashboards

DataDog monitors

In case of problems with the update workers

If the problem is localised to a single network

If the problem is not localised to a single network