Elevated error rates in US East 1

Incident Report for InfluxDB Cloud

Postmortem

Background

InfluxDB stores data in TSM files. Within these files, series are identified by series key, which is a combination of tag values, field names, measurement, organization and bucket id. When processes such as compactions or queries need to access a series, they read the series key from the TSM file(s). There is an enforced limit of 64 KB for a series key, but few series keys ever reach close to this limit.

For each series, the TSM files store:

The key size in bytes
The key itself
The data for that series

The reader code uses the key size (stored in the first two bytes as a uint16) and then uses that to read in the bytes from the series key itself. The reader then reads in the contiguous bytes based on that number, but offset by 2 in order to account for the key size data .

InfluxDB has a design principle to return a 500 error if there is ever a risk of returning wrong or incomplete data without the user being aware that the data may be wrong or incomplete.

For redundancy and other reasons, InfluxDB Cloud implements a Kafka queue in front of writes, that can store unsuccessful writes for long periods of time.

Trigger

A user happened to upload data which through coincidence had a series key which was exactly 64 KB in length. The code to read, in pseudocode is such as:

read(start: 2, end: 2 + key_size)

However, since the ensuing value was stored in a uint16, the value wrapped to size 1. This resulted in a function such as:
read(start 2: end: 1)

This code panicked because the end point was before the start point, causing a restart loop in the storage engine. While this only impacted a single TSM file, the system, as designed, returned errors rather than potentially returning incomplete query results. Additionally, the Kafka queue ensured that writes were safe while the impacted storage pods were unavailable.

Customer Impact

Subscribers in the impacted region faced read interruptions during this incident. Writes were queued for some buckets, but quickly recovered as service was restored.

Mitigation

The following pseudocode has been modified:

2 + key_size

Became

2 + int(key_size)

We have ensured that this mitigation has been applied consistently throughout the codebase.

Posted Jan 27, 2022 - 10:10 UTC

Resolved

This incident has been resolved. An RCA will be posted shortly.

Posted Jan 26, 2022 - 20:00 UTC

Monitoring

A fix has been implemented and queries should be working although data may be incomplete until service has been fully restored.

Posted Jan 26, 2022 - 19:31 UTC

Update

We have identified a fix and are in the progress of rolling it out.

Posted Jan 26, 2022 - 18:33 UTC

Identified

The team have identified the root cause of the issue and are working to restore service

Posted Jan 26, 2022 - 17:12 UTC

Update

The team are continuing to work to resolve the underlying issues in order to restore service

Posted Jan 26, 2022 - 16:36 UTC

Update

The team are continuing to work to mitigate the impact on customer's workloads

Posted Jan 26, 2022 - 15:40 UTC

Update

Some users may find that their tasks and alerts are not currently working.

Whilst writes are working, there may be some delay in data appearing within the UI

Posted Jan 26, 2022 - 14:41 UTC

Update

We believe that some users in the region are unaffected by this, whilst others will be experiencing a read outage.

The team continue to investigate and resolve this issue.

Posted Jan 26, 2022 - 14:37 UTC

Investigating

We are aware of elevated error rates in both Flux and InfluxQL in the AWS US East region.

Writes continue to succeed, so there is no data loss.

The team is working to correct the problem

Posted Jan 26, 2022 - 14:30 UTC

This incident affected: Cloud Serverless: AWS, US-East-1 (API Queries).