Service returning 503s
Incident Report for Axiom
Postmortem

Incident overview: A deployment of several database services inadvertently caused an overload of the supporting relational database, leading to a system-wide outage. Restoration of the service took around 60 minutes because of continuous thrashing in the database system. After service was restored, a small percentage of requests kept timing out due to an issue with our caching layer.

Investigation and Mitigation Efforts: The team identified and rolled back the code changes preceding the overload, upgraded our database infrastructure as well as caching service to handle higher loads more efficiently, and implemented changes to reduce the impact on the caching service.

Resolution and Follow-up: The deployment was rolled back, database and caching infrastructure was upgraded, and changes were made to reduce the load on the caching layer, fully restoring operational status.

Timeline

All times are in UTC starting on February 05, 2024.

14:55 - Start of rollout of new code

Our team deploys changes to core database service with the latest code.

15:01 - Deployment Finished

All services finish upgrading. We start seeing increased latencies processing requests.

15:02 - Huge Drop in Available Memory in Relational Database Service Detected

Our monitoring systems observe a significant drop in available memory of our relational database service, as well as increase in swap use following the deployment, which is acknowledged and escalated for incident investigation.

15:08 - Rollback of New Code is Initiated

The new code is rolled back. Due to the incident, replacement of services takes longer, resulting in a slower rollback.

15:25 - Relational Database Service is Restarted

In response to the issues, the relational database service is restarted. Right after the restart, we still see a drop in available memory, extensive swap usage and thrashing.

15:27 - Rollback Completed

All services are running the previously deployed version, yet the issue persists.

15:38 - Relational Database Service Restarts Again

The relational database service automatically restarts due to failing health checks.

15:50 - Database Instance Upgraded

Our team upgrades the relational database service to better handle the load and mitigate memory and swap issues.

15:56 - Service Metrics Start to Return to Normal

Following the RDS upgrade, memory usage remains at acceptable limits, the database is fully back online and we see gradual decrease in number of failed requests.

15:59 - We Start Seeing Issues with Our Caching Layer

Our caching layer is overloaded and is slowing down caching operations, this causes failures for about 1% of incoming requests as our proxies use increasing amount of memory leading to a crash after several minutes.

16:08 - Caching Service Continues Performing Sub-optimally

We continue seeing issues with our caching service and start investigation into this 2nd order problem.

16:43 - We Scale Up Our Caching Service

We scale up our caching layer, which doesn’t significantly improve caching performance.

16:54 - Problem with a spinning distributed lock is identified

We identify a few code paths which might be causing a number of distributed locks to be continuously acquired and released following cache invalidations post the initial part of the incident. We begin work to mitigate these.

17:35 - 18:39 - We Deploy Multiple Mitigations

We deploy multiple code changes to reduce the load on our caching service. This fully restores the service.

Impact

The incident resulted in a complete service outage, affecting all customers’ ability to ingest and query data for one hour, and reduced reliability for next ~90 minutes.

Next steps

In response to this incident, we are taking several actions to prevent similar issues in the future:

  1. Thorough Investigation: We’ve conducted an in-depth analysis to understand the cause of the database service overload and identified improvements in our deployment processes and infrastructure configuration.
  2. Infrastructure and Architecture Improvements: We are upgrading our database infrastructure based on learnings and separating shared resources to prevent cascading failures.
  3. Deployment Process Review: We are evaluating our deployment process to identify how we can rollout changes more safely and effectively, including better handling of large batches of changes and improving health checks.
  4. Enhanced Monitoring and Alerts: We are implementing more sophisticated monitoring and alerting mechanisms to detect potential issues more rapidly and to prevent service degradation.
  5. Community and Customer Communication: We are committed to transparent communication with our community and customers regarding both the incident and our steps to improve.

We take this incident seriously and are committed to learning from it to improve our systems and processes. We thank customers for their patience and understanding and invite anyone with concerns or questions to reach out to our team directly at support@axiom.co.

Axiom Team

Posted Feb 13, 2024 - 19:48 UTC

Resolved
This incident has been resolved. We will follow-up with a complete post-mortem within the coming week as we learn more.
Posted Feb 05, 2024 - 19:37 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 05, 2024 - 18:39 UTC
Update
We continue to investigate this issue, but we see service levels return to normal.
Posted Feb 05, 2024 - 16:08 UTC
Update
We are continuing to investigate this issue. So far our mitigations have not been successful.
Posted Feb 05, 2024 - 15:40 UTC
Investigating
We are currently investigating this issue.
Posted Feb 05, 2024 - 15:02 UTC
This incident affected: API and Ingest.