Monitors partially unavailable
Affected components
Updates

Write-up published

Read it here

Resolved

Overview: Following a deployment of updates to our monitoring system, some monitors failed to execute. Upon rectifying the issue, any missed monitor runs were back-filled and executed to ensure no time range was omitted from all monitors’ coverage.

Resolution and follow-up: We’ve identified a problem with the caching mechanism at the heart of monitors execution system which when having to deal with a slight model change can cause failed monitor runs if the various caches are not 100% invalidated. We had a job running to flush individual cache keys but it was being overridden by stale data in the system. To resolve it we needed to flush all monitor related caches while also restarting the execution pods.

Timeline

May 17, 2024

13:10 UTC - Failed monitor runs were observed

Following a deployment introducing a new type of monitors, we started seeing increased number of failed monitor runs in our monitor execution system.

13:34 UTC - A job to invalidate all monitor caches is started

We start seeing steady reduction in number of failing monitor runs.

14:00 UTC - Some failures remain following monitor cache invalidation

The cache invalidation job has a race condition that is causing some stale data to persist in the system, although failed jobs count is now down to 20%.

14:30 UTC - Cache invalidation job is restarted with fixed race condition

Invalidation job is run for a second time, we see steady reduction of failed monitor runs towards zero.

15:15 UTC - Incident is fully resolved

Impact

Based on our investigation, 58% of monitors experienced a delay in their execution. During the incident, our team communicated the outage through the status page and resolved it.

Next steps

In response to this incident, we are taking several actions to prevent similar issues in the future:

  • Thorough Investigation: We have conducted an in-depth analysis to understand the cause of the caching issue and identified improvements in our deployment processes and cache management.

  • Improved Cache Management: We will enhance our cache invalidation strategies to prevent similar issues when introducing model changes.

  • Deployment Process Review: We are evaluating our deployment process to better handle changes that might affect the monitor execution system, ensuring such changes do not cause outages.

  • Enhanced Monitoring and Alerts: We are implementing more sophisticated monitoring and alerting mechanisms to detect potential issues more rapidly and to prevent service degradation.

We apologise for any inconvenience this may have caused and understand how critical monitor execution is for confidence in system health. In this instance, our playback system ensured complete coverage of monitors that could not run due to the incident, but our next steps are focused on reducing the need for this playback behavior altogether.

We thank customers for their patience and understanding and invite anyone with concerns or questions to reach out to our team directly at support@axiom.co.

Fri, May 24, 2024, 12:06 PM

Resolved

Incident resolved, all monitors should be back to normal.

Fri, May 17, 2024, 03:15 PM(6 days earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Fri, May 17, 2024, 02:52 PM(23 minutes earlier)

Identified

The issue has been identified and a fix is being implemented. The majority of monitors should be working again.

Fri, May 17, 2024, 01:44 PM(1 hour earlier)

Investigating

A subset of monitors are currently experiencing some issues and they are currently offline. The team is investigating.

Fri, May 17, 2024, 01:10 PM(34 minutes earlier)