Outage in Azure US-East-1
Incident Report for InfluxDB Cloud
Postmortem

Summary

On 29 July 2022 at 17:01:11 to 29 July 2022 at 18:45:44, users in the Azure US-East-1 region encountered read and write unavailability due to disk management rate limits from an Azure outage, causing pods to get stuck in a ContainerCreating state when pods were being restarted.

Background

InfluxDB Cloud is built to be elastic by resizing available storage to meet the shifting storage demands of our users. As usage of the platform grows, requests are made to the cloud provider to allocate this storage with the general expectation of disk availability from the provider.  

Trigger

This incident was triggered by:

  1. Lowered API rate limits from Azure due to an outage in the US-EAST-1 region.
  2. Multiple scaling operations starting in the US-EAST-1 region.

Learnings & Mitigation Actions

Manually remove invalid Persistent Volume Claims (PVCs) and nodes that had disks that were failing to detach.

Posted Aug 09, 2022 - 22:51 UTC

Resolved
The issue is resolved and all the services are fully available.
Posted Jul 29, 2022 - 23:31 UTC
Update
We are continuing to monitor for any further issues.
Posted Jul 29, 2022 - 23:29 UTC
Update
We are continuing to monitor for any further issues.
Posted Jul 29, 2022 - 23:10 UTC
Monitoring
The issue is resolved and we are monitoring the system now.
Posted Jul 29, 2022 - 23:09 UTC
Update
We are experiencing query performance issues in Azure US-East-1 and are continuing to investigate. Writes were impaired for 10 minutes at 20:40 UTC.
Posted Jul 29, 2022 - 21:58 UTC
Identified
This outage looks to be caused by an earlier outage in Azure, we are working on bringing services back online.
Posted Jul 29, 2022 - 21:09 UTC
Investigating
We are currently investigating this issue.
Posted Jul 29, 2022 - 21:03 UTC
This incident affected: Azure: Virginia, East US (API Writes, API Queries, Tasks, Persistent Storage, Compute).