This is intended to provide resources to refer to in general but also in particular when there is a problem with the update workers.
The update workers are the processes concerned with posting our customers' posts to the social networks. They run on Kubernetes and there are over 200 of these processes. They listen to AWS SQS queues and act on the messages therein - each message containing an update id and some additional meta-data.
These dashboards split the various aspects of the social networks down across various dimensions. So, for example, it is possible to see how successful image posts to Facebook groups are. Or, for another example, to see how successful video posts to Twitter are. Often, different post types are treated differently by the social networks, so this can be valuable to see if any problem is isolated to some particular network & post type.
Rather than clicking between all these different dashboards, there is also a generic dashboard which can be used to drill down into specific combinations. This can be used cross-network as well - for example, to investigate if there might be any global image problem when posting from Buffer (as might happen if our S3 connection fails, for example).
There are a number of monitors set up in DataDog around this area. While generally intended to act as triggers of alerts, they can also be useful to take a look at, to see what 'normal' looks like.
- Consistent update broadcaster delay problem
- Updates queue backlog
- Abnormal Error Rates
- Twitter posting delay
- Lower percentage of on-time updates
- Not enough update workers on Kubernetes running
- Updates look to be delayed!
- Look to see what networks are affected
- Look to see what post types are affected
Next steps depend on the answers to these questions.
(Here, Facebook and Instagram may or may not count as a single network - use your best judgement!)
- Try posting directly on the service
- Scan Twitter for any reports of problems (provided Twitter isn't down!)
- Look at the network's status page for any clues
If you are unable to post direcly, that would suggest a problem at the network's end. In this situation, there's not too much we can do except to provide information to the customers:
- Liaise with advocates with a view to potentially putting up an in-app banner and updating our status page
- See if you can reproduce the problem and see what error (if any) returns from the network
- If the error is detectable and new, potentially deploy a change in the error handling to give a more intuitive error to customers
If you are able to post directly, that suggests something might have broken at Buffer's end.
- See if there are any recent deploys that might be relevant
- If so, look into rolling these back to see if that gives any improvement
If it's still not clear, it's time to roll up the sleeves and investigate!
- Liaise with advocates with a view to potentially putting up an in-app banner and updating our status page
- Make use of a dev server for some live debugging to find out what might be going wrong
- Try and find out if we are making a request to the social network (we should be) and what response it might be returning (if any)
- Report progress and investigation in the #eng-incidents Slack channel
In this case, the problem is likely something wrong at Buffer's end, or else some sort of infrastructure / network problem.
- Check if the Buffer Publish app itself is working at https://buffer.com
- Check for any recent deploys that may have have an effect - rolling back any that may be the cause
- Try posting directly on some of the services
- Scan Twitter for any reports of problems
- Look at the AWS status page
At this point it may still not be clear what the problem is and it's time to investigate by:
- Making use of a dev server for some live debugging to find out what might be going wrong
- Trying and find out if we are making a request to the social network (we should be) and what response it might be returning (if any)
- Reporting progress and investigation in the #eng-incidents Slack channel