Queries do not return data for some series

Incident Report for InfluxDB Cloud

Postmortem

Summary

A refactoring intended to improve readability and support future improvements combined with a limited amount of invalid data in AWS us-west-2 caused the series file and index code path to encounter read and write problems.

At the code level a change was made to guarantee that certain files were closed on error. Instead, that code was relying on the file not being closed on error.

Finding and fixing the issue was compounded by some extra factors:

The line of code that triggered the bug was not intentionally added, but was rather a remnant of previous debugging. Therefore, it was not immediately the focus of troubleshooting.
A secondary bug existed in the test bucket that we were using to verify the changes. This was the result of a known issue related to illegal characters in that particular bucket.

Why weren't tests included with that refactoring? The team did include a new test with the refactoring, but the specific code that caused the problem was purposefully excluded from the refactor, and therefore, was not covered by the new test.

Why wasn’t the code immediately reverted? The team did revert the code immediately, but we follow the principle of not releasing changes to production unless we have positive confirmation that they will help. However, the above mentioned secondary bug caused the tests to fail. This impaired verification testing until the secondary bug was discovered.

Why did this only impact one region? As mentioned above, the problem was triggered by the combination of the code change to guarantee closing files and the presence of certain “illegal” data in storage. Only us-west-2 contained such data.

Did customers lose data? No. All writes succeeded, save for a few hours of data that was not written, but was of course saved in our backups ( Our data durability system is described here: https://docs.influxdata.com/influxdb/cloud/reference/internals/durability/). The missing data was replayed to guarantee that all data was written. Because certain series were not returned in queries, users’ tasks likely did not behave as expected. For example, checks may have stopped writing statuses for specific series, or downsampling did not occur as expected. End users can rerun tasks for the impacted time if needed.

Where did the “bad” data come from? The team traced the source of the bad data (zero length tag values) to a flux function that bypasses our usual storage client. This appears to bypass the safeguards present in the storage client, specifically because the Flux function does not ensure all data is valid line-protocol.

Timeline

Friday June 11 - Refactoring PR is released, makes it way through CI and CD, and then into production.

Saturday June 12 - Alerts start showing that there are errors occurring in the internal environment. Code is reverted there, and the problem appears mitigated. Series begin to be missed in queries in production, but goes unnoticed.

Sunday June 13 - The team believed that production was stable so the team by and large monitored the internal environment.

Monday June 14 - The team prepares and tests several new versions with different combinations of code revisions, as well as embarks on debugging of files produced in production. Images with reverted code are built, but tests continue to fail. A small number of customers report that they are missing series in their queries. All of these customers are in the same region (AWS us-west). The team confirms that data continues to be written successfully, so there is no underlying data loss.

Tuesday June 15 - More customers report missing data in their queries.Later that evening, the secondary bug causing test failures is uncovered, and it is then confirmed that a code reversion will work.

Wednesday June 16 - The reverted code is released to the internal environment and heavily tested there to ensure that it fixes the problem and does not create new problems. The code is then released only to the impacted cluster. We experience a query outage of approximately 7 or 8 minutes during the deployment, but all series become queryable. It takes approximately 3 hours for all data to be recovered from S3. The team does an investigation and identifies a small amount of data that was collected and not written. The team isolates that data and writes a custom tool to replay this data subset in an efficient manner.

Thursday June 17 - The team replays the small amount of missing data. The team updates all production clusters with the reverted code. The team embarks on a final investigation to confirm that all the problems were fixed, and that no new problems have arisen. The incident is closed.

Next Steps and Process Changes

Enhanced code reviews.
Full audit of notifications to ensure sufficient deadman alerting.
Reviewing error levels in logging to amplify signal and reduce noise in log messages.
Schedule time to clean up cruft in the data (for example the data involved in the secondary bug mentioned above).

Posted Jun 25, 2021 - 22:34 UTC

Resolved

All data has been restored and all known query issues have been resolved. Tasks scheduled within the timeframe of the incident may need to be rerun to ensure data consistency resulting from those tasks runs.

A minor code refactoring was merged to improve the query path and support future improvements. This, in combination with unexpected misformatted time series data, caused the series file and index code path to encounter read and write problems in the AWS us-west-2 region. The problematic refactored code has been reverted and all malformed data has been removed. All data has been replayed to this region to restore full data integrity.

A full root cause analysis will be added to this incident when it is completed.

Posted Jun 17, 2021 - 23:24 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 17, 2021 - 15:45 UTC

Update

The vast majority of customer data has been replayed and customer’s can expect all their data to be present in the main. However there remains a very small portion of data that may be missing. The team is working to recover this data from backups today. We don’t have a timeline yet. It should affect a very small portion of customers.

Posted Jun 17, 2021 - 13:15 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 17, 2021 - 08:43 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 17, 2021 - 07:05 UTC

Update

We are continuing to monitor and will provide more updates as they come available.

Posted Jun 16, 2021 - 23:49 UTC

Monitoring

We have implemented a fix for query issues (for those affected) that has remedied this known issue over the last 1-2 hours. All queries moving forward are no longer affected. We are working to implement a fix for historical queries for the last 72 hours or more. Will provide more updates as they come available.

Posted Jun 16, 2021 - 22:17 UTC

Identified

We are working on a fix. API query response is improving. This may have affected historical queries or tasks ran in the last 72 to 96 hours. Will provide more updates as they come available. Discovery is still ongoing as we work on additional fixes.

Posted Jun 16, 2021 - 21:26 UTC