- 15:00 UTC(Sat) All objects have been restored.
- 20:30 UTC Script is still running, 137 missing objects have been recovered.
- 18:00 UTC Restore script begins execution.
- 15:30 UTC Fixed a few Factories by hand. We begin finding everything affected and create a restoration script.
- 15:00 UTC Root cause identified. Container builds done on Jan 10 and 11 may have issues:
- build cache layers used by buildkit may be missing
- compose apps may be missing.
- 14:45 UTC Compose applications on hub.foundries.io are no longer available after routine maintenance. Restoring compose applications on hub.foundries.io from backups.
- 14:30 UTC We are investigating a report of devices not being able to update or switch compose applications on hub.foundries.io
The container registry implementation we use from Docker is designed to use shared content-addressable blob storage. Because of this design, operators must run a garbage collection utility to free up unused blobs.
We'd practiced running this tool in a "dry run" mode and results looked okay. We then ran the tool in production on Jan 10. The tool takes a while to run and completed on Jan 11. During this window, customers who'd built compose apps and container images were having issues. After some debug we found that the garage collector seemed to unsafe for:
- Buildkit's build cache layers we push to speed up builds
- Compose app manifests
These objects had been deleted. Fortunately, we have lifecycle management rules set for our production buckets. We also had a log of all objects that were deleted by the garbage collector. We quickly built a tool to parse our log file and restore all objects that had been deleted during this window.