Write-up published
Resolved
Overview: Following a deployment of updates to our monitoring system, some monitors failed to execute. Upon rectifying the issue, any missed monitor runs were back-filled and executed to ensure no time range was omitted from all monitors’ coverage.
Resolution and follow-up: We’ve identified a problem with the caching mechanism at the heart of monitors execution system which when having to deal with a slight model change can cause failed monitor runs if the various caches are not 100% invalidated. We had a job running to flush individual cache keys but it was being overridden by stale data in the system. To resolve it we needed to flush all monitor related caches while also restarting the execution pods.
May 17, 2024
13:10 UTC - Failed monitor runs were observed
Following a deployment introducing a new type of monitors, we started seeing increased number of failed monitor runs in our monitor execution system.
13:34 UTC - A job to invalidate all monitor caches is started
We start seeing steady reduction in number of failing monitor runs.
14:00 UTC - Some failures remain following monitor cache invalidation
The cache invalidation job has a race condition that is causing some stale data to persist in the system, although failed jobs count is now down to 20%.
14:30 UTC - Cache invalidation job is restarted with fixed race condition
Invalidation job is run for a second time, we see steady reduction of failed monitor runs towards zero.
15:15 UTC - Incident is fully resolved
Based on our investigation, 58% of monitors experienced a delay in their execution. During the incident, our team communicated the outage through the status page and resolved it.
In response to this incident, we are taking several actions to prevent similar issues in the future:
Thorough Investigation: We have conducted an in-depth analysis to understand the cause of the caching issue and identified improvements in our deployment processes and cache management.
Improved Cache Management: We will enhance our cache invalidation strategies to prevent similar issues when introducing model changes.
Deployment Process Review: We are evaluating our deployment process to better handle changes that might affect the monitor execution system, ensuring such changes do not cause outages.
Enhanced Monitoring and Alerts: We are implementing more sophisticated monitoring and alerting mechanisms to detect potential issues more rapidly and to prevent service degradation.
We apologise for any inconvenience this may have caused and understand how critical monitor execution is for confidence in system health. In this instance, our playback system ensured complete coverage of monitors that could not run due to the incident, but our next steps are focused on reducing the need for this playback behavior altogether.
We thank customers for their patience and understanding and invite anyone with concerns or questions to reach out to our team directly at support@axiom.co.
Resolved
Incident resolved, all monitors should be back to normal.
Monitoring
A fix has been implemented and we are monitoring the results.
Identified
The issue has been identified and a fix is being implemented. The majority of monitors should be working again.
Investigating
A subset of monitors are currently experiencing some issues and they are currently offline. The team is investigating.