Technical difficulties

Incident Report for Proton Services

Postmortem

May 4th 2023 incident report:

At around 18:25 Zurich time, Proton experienced a downtime which impacted the ability of users to connect to Proton services. Services were restored starting at around 19:00 Zurich time, with no data loss due to Proton's redundant architecture.

The root cause was due to a failure of Proton's Redis cluster. Proton's Redis cluster is designed in a highly redundant way, with dozens of nodes spread across multiple datacenters, in order to be insulated from hardware failures or even failures of an entire datacenter. As a result, a downtime of our Redis cluster is a highly unusual and unlikely event.

Starting at around 18:11 Zurich time, Proton started losing Redis cluster nodes unexpectedly. Eventually, at 18:25, enough nodes were lost that the entire cluster failed. Because of the loss of so many nodes, it would have taken too long to restore nodes one by one and the entire cluster needed to be restarted.

As this is not a routine operation, Proton's infrastructure team needed some time to prepare the operation. Our recovery was also delayed because Redis is a caching layer, which means that traffic needed to be gradually introduced while the rebuilt Redis cluster was populated, in order to avoid overloading downstream systems. For this reason, recovery took approximately half an hour.

After an investigation, the root cause was found to be due to Redis nodes running out of memory. Proton's Redis infrastructure usually prevents this with strict memory limits that are put in place on each node. However, due to a mistake with setting these limits, some of Proton's Redis nodes were configured with a limit that exceeded the actual system memory. When a Redis node goes down, the traffic is general moved to the remaining Redis nodes, increasing the load on them. When this happened, it tipped more of the misconfigured Redis nodes above the memory limit, causing them to crash as well, and increasing the load on the fewer remaining nodes.

Ordinarily, this would not be fatal due to redundancy in the Redis cluster, but unfortunately the memory misconfiguration error was present on around half of Proton's Redis nodes, and the cluster was unable to function with half of the nodes down.

The memory limits have now been correctly set on all nodes, which will prevent a reoccurrence of this issue. We will also introducing additional checks to detect such misconfigurations in the future so they can be proactively found and fixed.

We want to apologize again for the inconvenience and thank you for your patience.

Posted May 04, 2023 - 20:25 CEST

Resolved

This incident was resolved starting approximately 7PM CET on 4-May-2023. We apologize for the inconvenience. We confirm no data has been lost.

Posted May 04, 2023 - 19:05 CEST

Update

We apologize for the current situation. No data is lost, but access and notifications may be delayed. We are working on the issue and hope to fully restore our services shortly. We’ll continue sharing updates here as they happen.

Posted May 04, 2023 - 18:57 CEST

Investigating

We are currently experiencing technical difficulties affecting some of our Mail, Calendar, and Drive users. We are working to fully restore services as soon as possible. We apologize for the inconvenience

Posted May 04, 2023 - 18:31 CEST

This incident affected: Proton Mail (Web Application, Incoming Mail, Outgoing Mail, Bridge, Mobile Apps, Push Notifications), Proton Calendar (Mobile Apps, Web Application), and Proton Drive (Web Application, Mobile and Desktop Apps).