RCA - PrismPOS September 16th, 2023

PrismRBS Incident Review

CS-111326

Incident Date: 9/16/2023 2:33am-3:15pm CDT

Issue Date: 9/21/2023

 

Description of Incident

Internal monitoring alerted PrismRBS of an issue with US PrismPOS customers (Canadian customers were unaffected) starting at 3:03am CDT and customer reports started at 7:30am CDT coinciding with the time PrismRBS support and engineering teams began investigation.  Microsoft notified us that we were an impacted customer to a power outage related to Azure SQL hosted in the East US region at 2:33am CDT and provided updates every 60 minutes.

PrismRBS engaged Microsoft directly at 9:30am CDT to escalate the issue, which Microsoft acknowledged.

Starting at 1:35pm CDT individual customer databases started remediating with the last customer database remediated at 3:15pm CDT.  At 3:15pm CDT PrismRBS considered the incident resolved.

Root Cause

A brief power disruption impacted several compute racks and underlying network infrastructure causing some compute nodes unable to boot up.  This resulted in unavailability in SQL databases hosted in the East US Azure region.

Canadian PrismPOS customers were unaffected.

 

Timeline and Remediation

9/16/2023 2:33am CDT – Microsoft notifies PrismRBS as an affected customer for a Azure SQL outage in the East US region.

9/16/2023 3:03am CDT – First notification to internal resources of fired alert (due to sampling delay).

9/16/2023 7:30am CDT – First customer report of outage to frontline support.  Frontline support engages engineering team to investigate.

9/16/2023 9:30am CDT – PrismRBS engages Microsoft to escalate.

9/16/2023 1:35pm CDT – First batch of customer databases are remediated.

9/16/2023 3:15pm CDT – Last batch of customer databases are remediated.

Preventative Measures

Please see attached RCA from Microsoft for list of Microsoft preventative measures.

PrismRBS is enabling zone-redundancy for US hosted Azure SQL infrastructure on Monday morning September 25 at 12:00am CDT.  This is not anticipated to have any down time.  While this still limits PrismPOS Azure SQL Infrastructure to the East US region, it will prevent outages in up to two datacenters from impacting Azure SQL availability.

PrismRBS is evaluating a full geo-replicated POS infrastructure outside of the Azure East US region.  This allows full availability across the US in the event an entire Azure region is unavailable.

 

Appendix (all times in CDT)

See attachment for preliminary RCA from Microsoft.

 

Microsoft Azure Issue Summary