As if we needed another reminder of the increasingly interconnected nature of society these days, we’re in the midst of an incident at Microsoft Azure that is affecting (amongst other things):
1) some AEMO market systems; and
2) in parallel (and partly as a result) our own Market Data services as well.
Quite ironic, that a power surge should take down Azure systems (and affect the electricity market as a result) only hours before the publication of the 2023 ESOO (which warns of other challenges).
(A) Problems at Azure … stemming from a power surge
The Azure status page (located here) currently reads as follows:
‘Multiple services recovering after power/cooling issue – Australia East
Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware.
Multiple downstream services were impacted, with targeted communications being distributed via Azure Service Health. Impact to services is limited to Australia East, except for Azure Kubernetes Service (AKS) which has impact in both Australia East and Australia Southeast due to a dependency in the former. If your workloads are protected by Azure Site Recovery or Azure Backup, and you need critical services back online before all services in this datacenter are fully recovered, we recommend either to initiate a failover to the recovery region or recover using Cross Region Restore. Note that any new allocation requests for the Australia East region will automatically avoid the impacted scale units.
Current Status: We have made significant progress in restoring core services, and expect that the vast majority of remaining services should be back online in the next 1-2 hours. After restoring power and stabilizing temperatures, all network infrastructure and 95% of storage services are back online. All premium disk storage has fully recovered, we continue to work towards mitigating the final remaining storage devices. The majority of underlying compute services are back online, with more than 85% of Virtual Machines (VMs) that were impacted now back online and healthy. For the remaining VMs, we are investigating potential issues in connecting to their corresponding storage services.
While many customers have already recovered, we continue to work with downstream impacted services to ensure that they are coming back online in the next 1-2 hours as expected. Further updates will be provided in 60 minutes, or as events warrant.
This message was last updated at 19:51 UTC on 30 August 2023’
Translating into NEM time, this event began at 18:30 on Wednesday 30th August 2023 and this latest update above came 11.5 hours later (at 05:51 Thursday 31st August 2023).
(B) Possibly due to storms in Sydney?
It may be – I note that AusGrid still uses Twitter for status updates (at least for now) and posted this at 18:53 on Wednesday 30th August:
It is possible that this storm might have been the cause of the power surge that affected Azure?
(C) Causing problems for AEMO
At 22:00 NEM time on Thursday 30th August 2023 the AEMO posted an alert as follows:
‘INC0118077 – Alert – Issue with Azure Australia East Region causing impact to multiple services
AEMO has declared a major incident for the applications hosted Azure Australia East Region. This is an issue at Microsoft end which is being investigated by Microsoft Support.’
… and from that time several more updates were provided. For instance at 03:15 NEM time the AEMO update noted:
‘All impacted hardware previously powered off as a preventative measure has now been powered back on by Microsoft. Microsoft continues mitigation efforts to bring all affected services back online.
Impact to below services remains same.
AEMO Website
MSATS – eMDM
DERR – Distributed Energy Resources Register
Connections Simulation Tool
Consumer Data Platform (CDP) ‘
(D) Causing issues for us, and our clients
As I post this, our Service Status Page (accessible here) still contains too many alerts:
https://www.theregister.com/2023/09/04/microsoft_australia_outage_incident_report/
The Register has some detail from a Microsoft report, which confirms the power issue occured during the thunderstorms. It notes that only 3 staff were on site to try and get 5 tripped chillers and 2 backup units up and running. Only one restarted successfully.
The loss of cooling created a cascade of problems. “Storage hardware damaged by the data hall temperatures “required extensive troubleshooting” but Microsoft’s diagnostic tools could not find relevant data because the storage servers were down.”
It does raise a question of whether the “public cloud” is a suitable place to house critical infrastructure.