Incident Report for InfluxDB Cloud Shutdown in GCP Belgium and AWS Sydney
MESSAGE FROM INFLUXDATA FOUNDER & CTO PAUL DIX
There were a number of process, planning, and human errors that all came together at the most inopportune time to lead to the incident and its consequences following our shutdown of our multi-tenant InfluxDB Cloud 2.0 service operated on infrastructure in GCP Belgium and AWS Sydney. As we reviewed what happened, we started with the simple requirement that we must never let this kind of thing happen again. Any user or customer on our hosted services must be able to retrieve their data and must have ample warning for large changes or shutdowns. Even with perfect communication, we should expect the possibility that some users may not be aware of an impending shutdown ahead of time. On review, we realized that our communications had reached fewer users than we had expected and our process for managing the shutdown didn’t include temporarily turning the service off before shutting it down completely (i.e. a scream test).
We deeply apologize for this. We’ve learned from this painful experience and will continue to improve our processes, planning, and communications with our customers. We take our responsibility as the stewards of all of our customers’ data very seriously and will do everything we can to protect it and make it available when you need it.
In early July, as part of broader efforts to consolidate cloud regions in anticipation of the InfluxDB 3.0 product suite, InfluxData discontinued its multi-tenant InfluxDB Cloud 2.0 service in two regions: AWS Sydney and GCP Belgium.
InfluxData communicated the planned shutdown of the regions to customers and users over the span of several months. While many users knew of the upcoming closure and successfully migrated their data, some users were caught unaware by the shutdown of the InfluxDB Cloud service in those regions and requested a copy of their data after the closure. InfluxData ultimately recovered the time series data for the GCP Belgium users. Data in the AWS Sydney region could not be recovered.
InfluxData took the following steps to inform users of the July 2023 shutdown and support them in migrating their data to a different InfluxDB Cloud region:
- Laid out a schedule of email updates to all customers, sending three emails total to the account information we had on file. We sent those emails on February 23, April 6, and May 15, 2023.
- Reached out to customers with whom we had direct sales relationships to help them understand the situation and the timeline. Had direct conversations with them, and assisted them in migrating when needed.
- Updated the homepage of the UI for InfluxDB Cloud 2 in those regions with a notice that the service was going to be shut down on June 30, 2023.
- Actively supported every customer or user who responded to any of our communications to migrate their data and workloads to a different InfluxDB Cloud region.
- Continued operating the InfluxDB Cloud regions for a few extra days in case customers had not completed their migrations.
Many users received the above notifications and successfully migrated their data to another cloud region. Unfortunately, some users did not receive the notifications and lost their data when we discontinued the two cloud regions, for which we deeply apologize. While we successfully recovered the time series data for the GCP Belgium users, data in the AWS Sydney region could not be recovered.
Our assumption that the emails, sales outreach, and web notifications would be sufficient was overly optimistic. We could have mitigated the risk that some users might not be aware of our notifications and the shutdown by taking the following additional steps.
- Create a separate category of “Service Notification” emails that customers could not opt-out of. These emails would only be for critical service updates and would come from a support/formal alert alias.
- Increase email communications in frequency as the shutdown date approached with additional reminders 24 hours ahead of, and on the day of the shutoff.
- Redouble efforts to contact users who have not reduced their reads or writes within the 30 or 45 days before the end-of-life date for the region.
- Conduct at least one scream test by shutting down the services for an hour or more to give users who did not register the notifications a chance to notice that their workloads were not running, and then turn the service back on for a short time period to give those users one last chance to migrate their data.
- Implement a 30-day data retention grace period where we export and store customer and user data in a form where we could readily restore it, before deleting it.
- Add a banner at the top of the status.influxdata.com page as soon as the initial notifications went out (and leave it there until the service was terminated).
PRODUCT & FEATURE END-OF-LIFE PROCEDURES
After analyzing the timeline of events that led up to the shutdown and fallout, we are formalizing and updating our EOL procedures. InfluxData will adhere to the following process for any End-of-Life of Products and Features, including the shutdown of specific cloud regions.
Email Notices: Customers will be given earlier notice (at least six months) of the change in service via the following communication methods:
- Email notifications will be sent to all users and billing contacts associated with the account.
- Emails will be sent as “Service Notification” emails that customers and users cannot opt out of. Service Notifications will be limited to critical service changes/updates and come from a support/formal alert alias.
- Initial email notifications followed by at least two additional reminders. For longer notice periods (one year plus), reminders will be sent at least quarterly. As the date approaches, additional reminders will be sent. Any new accounts established after the date of the first email notification that would be impacted by this service shutdown will be included in the subsequent reminders.
Non-Email Notification: Notifications and reminders will be added to the following areas:
- Add in-app notification for InfluxDB products that include a UI that users must read or acknowledge in order to proceed in the app.
- For annual accounts or accounts billing more than $500 per month, an additional personal outreach by a Sales Team Member or Tech Support will be attempted.
- Add notification to docs.influxdata.com informing visitors of the upcoming end-of-life. The notification will remain from the date of the first email through the end-of-life date.
- Publish reminders in the Community Slack channel starting on the day of the initial email and the day before the event.
- For any planned cluster removal, a banner notification will be added to the Status page located at status.influxdata.com at the same time the first notification gets sent (at least six months in advance of the shutdown).
Fail-Safe Controls: Because the above communication methods may not be 100% effective, InfluxData will implement the following fail-safe controls to allow for a “scream test” with the ability to notify customers via an outage and wait to see who responds before shutting down the service or feature, and before deleting any data. At least 30 days before the service is scheduled to be removed, InfluxData will temporarily disable the service (in a fully reversible manner) for up to 24 hours so that all users relying on the service should be able to detect the service loss and can get assistance. After 24 hours, the service will be returned to normal operation. Depending on the results of the first “scream test” we may perform additional scream tests.
- As soon as the first “scream test” is started, a banner will be added to the top of the status.influxdata.com page, advising about the service that is being removed. This banner will stay in place until the service has been fully removed.
- Feature Flagging: For reversible actions and other changes that can be feature flagged, we will add a feature flag, which enables flexibility to turn off the service for everyone, but still allow it for individual accounts in case of emergency.
- System Backup: Backup the system at the point in time of execution of the end-of-life, including data and configuration when appropriate.
- Export Capability: For any end-of-life event that would cause customer data to be deleted, an export capability will exist to give the customer a reasonable method of exporting their data.
- Waiting Period: There will be at least a 30-day waiting period from disablement of service before the service is actually taken down. This will allow 30 days for customers and users to report a service outage and notify them if they were not aware of the end-of-life.
- Data Recovery: If a customer contacts us within the 30-day waiting period, we will restore the service when possible, or will provide backups of data if a restore of the service is not possible.
Data Retention:Data retention in InfluxDB Cloud is described in InfluxData’s documentation and SOC-2 Statement.