Some monitors are being auto-disabled
Incident Report for Axiom
Postmortem

Incident and immediate action (Jan 25, 2024): Query service degradation was detected and resolved within the same evening; service restored by 2213 UTC.

Investigation and secondary impact (Jan 26-30, 2024): Uncovered and fixed an issue where 84 organizations had monitors auto-disabled due to the service degradation.

Resolution and follow-up: Deployed fixes to prevent auto-disabling of monitors from query failures, addressed root cause that triggered query degradation and enhanced monitoring for such incidents; all affected customers were notified and monitors re-enabled.

Timeline

All times are in UTC starting on January 25, 2024.

January 25, 2024

1921 - Uplift in query error rates detected

Our monitoring systems flag an uplift in query error rates, which is acknowledged with no further action required.

2104 - Degradation to query service detected

Our monitoring systems alert on a more significant spike in query error rates, which is acknowledged and escalated for incident investigation. Axiom’s status page is updated to reflect the investigation.

2148 - Query service is restarted

With queries starting to fail consistently for customers, our infrastructure team restart Axiom’s query runners, which quickly shows positive signs of service restoration.

2213 - Known incident is marked resolved

Our monitoring systems show query service has returned to its healthy state, so the known incident is marked as resolved. Work starts to understand root cause of query service degradation to avoid repeated behavior.

January 26, 2024

0815 - Team identify potential impact to monitor service

Our team recognise that query service degradation might have inadvertently disabled some monitors for customers, prompting investigation.

1010 - Cause of disabled monitors is understood and fix begins

Upon investigation, our team discover that a feature designed to auto-disable monitors in cases like schema changes rendering a monitor configuration invalid can trigger also during incidents. Work begins to narrow the conditions under which this disabled state is applied.

1015 - Customer report received

Our team receives a report through Slack Connect from a customer who had seen three monitors placed in an auto-disabled state unexpectedly, with emails from Axiom having notified their team.

1447 - Monitors are re-enabled

Our team run a script to re-enable monitors that had been auto-disabled during the incident.

January 29, 2024 1127 - Metrics indicate higher numbers of auto-disabled monitors since Friday

We begin investigation into whether this is related to the incident from last week.

1425 - Remediation for disabling of monitors is deployed

After identifying stochastic process which continues to trigger auto-disabling of monitors, our team adjusts system configurations to reduce the likelihood of this occurring whilst a fix is being implemented.

1852 - Fix for permissive disabling of monitors is merged

Our team tests and merges changes that narrow the conditions under which a monitor is auto-disabled to ensure query service degradation does not impact monitor service.

2243 - Fix for permissive disabling of monitors is deployed

Fix is rolled out to our production environment.

January 30, 2024

0804 - Fix verified and remaining impacted monitors re-enabled

Our team confirm that no further monitors have been auto-disabled since fix was rolled out, and re-run script to enable monitors that were disabled since last run.

Impact

Based on our investigation, a handful of organizations experienced query failures when working with Axiom during the period of query service degradation on January 25th. Regular query behavior was restored within 35 minutes.

As a fallout from the query service degradation, 84 organizations had one or more monitors auto-disabled unexpectedly. Our systems were designed to disable monitors automatically when associated queries fail due to dataset schema changes. This helps prevent alert fatigue for large organizations, with the option to re-enable available as needed. Unfortunately, in failing to check the cause of query failures, disabling happened when it shouldn’t have. This codepath has since been amended to rectify.

During the full incident lifecycle, customers were alerted of monitors being auto-disabled automatically via email as usual, and could re-enable monitors manually. Our team also ran scripts to re-enable monitors that had been disabled inadvertently during the incident.

Next steps

We take the reliability of our systems very seriously, knowing that our customers depend on our products to have confidence in the health of their own. The investigation and remediation of issues described were our top priority and all impacted organizations have been notified.

To help prevent similar incidents happening in the future, we have prioritized a series of follow-up actions. Alerts have been configured to track abnormalities in the auto-disabling of monitors, which could have surfaced the fallout from query service degradation sooner. We will revise our incident response plan to account for second-order effects and to communicate with customers through the lifecycle more effectively. We also continue to investigate the root cause of the temporary query service error rates to build resiliency.

Thank you for your patience and understanding during this incident. Please feel free to contact our team if you have any questions or concerns at https://axiom.co/contact.

Axiom Team

Posted Feb 13, 2024 - 19:41 UTC

Resolved
This incident has been resolved.
Posted Jan 30, 2024 - 07:39 UTC
Monitoring
We've deployed the fix and are monitoring results.
Posted Jan 29, 2024 - 22:45 UTC
Identified
We're currently addressing an incident impacting alerting, where some monitors are being auto-disabled unintentionally. Affected organizations will receive an email in case their monitor gets disabled, and you can re-enable it manually. Our team is prioritizing a fix and implementing measures to prevent recurrence. We apologize for the inconvenience and appreciate your patience.
Posted Jan 29, 2024 - 07:08 UTC
This incident affected: Alerting.