myshophosting Incident Report
Storage unavailability – 13 January 2011
Executive Summary
On Thursday 13 January 2011 at 16:30 AEDST, myshophosting hosting services in Sydney were interrupted for 2 hours resulting
from reduced storage availability.
At 16:30, the myshophosting Network Operations Centre detected multiple failures to hosted services located within the Sydney data centre. An investigation into the cause and activities to undertake resolution were immediately commenced. The following timeline show the key activities taken to identify and resolve the incident:
16:30 – initial failure detected and investigation commenced.
17:40 – fault isolated to storage servicing VMWare blade-centres.
18:00 – fault logged with hardware vendor.
18:15 – hardware vendor escalated to T2 support.
18:22 – specific hardware fault identified.
18:30 – fault rectified and storage taken out of data protection mode. VM services hosted in Sydney start responding. Process to check all VMs and rectify servers that did not automatically recover commenced.
22:00 – Checks and rectification of entire Sydney VM fleet completed.
Most services were restored shortly after the fault resolution at 18:30, however some services remained affected until 22:00.
The post incident investigation identified the cause to be a faulty hardware component that had partially failed and caused the storage system to go in to data protection mode to ensure the integrity of all data; this made the data inaccessible to the VMWare Blade Centres and caused the outage. Data protection mode is expected behaviour; however the intrinsic High Availability of the device should have prevented service from being disrupted. Additionally the device should have generated an alarm indicating that it had experienced hardware failure which would have greatly assisted in troubleshooting. This alarm was not generated.
The incident was not the result of malicious or accidental activity.
Steps are being taken in conjunction with the hardware vendor to prevent this from occurring again, and to identify the reason that High Availability did not prevent service outage and the device did not alarm.
As always, myshophosting work diligently to identify, resolve and report on incidents as rapidly and professionally as possible.
myshophosting value customer partnerships and aim to provide maximum availability of all services at all times.