Dekel Tankel
posted this on April 29, 2011 15:43
We had two different service interruptions on April 25th and 26th. We want to provide you with more details as to what caused those interruptions and what we are doing to address those problems.
For those of you fluent in Cloud Foundry architecture and our open source code base some of the component names will be recognizable. For others, here are a few pointers that you can use to follow along:
April 25, 2011
At 5:45am PDT our monitoring systems detected an intermittent failure in two DEA nodes. While not a normal event, this particular condition is one that Cloud Foundry is designed to recover from with ease and in this case the system did exactly that. Over the next 30min, the intermittent failure became less and less intermittent and spread to some additional DEA nodes and started to impact the Cloud Controller.
At 6:11am PDT our monitoring systems issued a high volume of alerts. The DEA nodes had stabilized but the Cloud Controllers (all 8) had lost all connectivity to portions of the storage subsystem. As we indicated on support.cloudfoundry.com, this event caused the Cloud Controller and Health Manager to enter into a read-only mode. The customer visible impact of this was that all developer facing control operations (login, logoff, create app, start app, stop app, etc.) were no longer possible. Virtually all commands normally issued from VMC and Spring Tool Suite client resulted in failure.
Existing applications were not impacted by this event and continued to operate normally. The folks most impacted by this event were the developers who received their access credentials the night before. They could not log in until 3:30pm when the system health and storage connectivity was fully restored to 100% availability.
The root cause of the failure has been identified. It was a partial outage of a power supply in a storage cabinet, which impacted access to a single LUN. While not a “normal event”, it is something that can and will happen from time to time, and it is the reason that highly available systems are always built at both the hardware and software layers. In this case, our software, our monitoring systems, and our operational practices were not in synch. The net impact of these three events was that the Cloud Controller saw a partial loss of connectivity to a single LUN. This is an event that we did not properly handle and the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations.
Once the system had entered this state, it took us several hours to validate that we had no loss of data and that the storage cabinet was operating correctly and at full reliability and redundancy.
April 26, 2011
One of the action items from the previous day’s partial outage was to develop a full operational playbook for early detection, prevention, and restoration should our systems fail to properly handle any sort of intermittent loss of connectivity to storage. At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.
Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
This was our first total outage, which is an event where we need to put up a maintenance page. Some of you may have noticed that while the www.cloudfoundry.com maintenance page was posted correctly, the page designed to cover for all applications at *.cloudfoundry.com did not. This issue has been corrected.
During this outage, all applications and system components continued to run. However, with the front-end network down, we were the only ones that knew that the system was up. By 11:30am PDT the front end network infrastructure was fully operational.
Summary
We take full responsibility for these issues and apologize to our users who were impacted by them. We can and will do better, having already learned from these incidents. We greatly appreciate your patience as we improve our service and the underlying technology, while building capacity to deal with the extraordinary level of demand that we are experiencing.
We will continue to invest in making Cloud Foundry the best and easiest place to deploy and use applications anywhere. We hope you’ll join us, and work with us, as we do so.
-The Cloud Foundry Team