502 Bad Gateway

Incident Report for Wecomplish

Postmortem

On May 21st, between 08:28 and 10:30 Norwegian time, the Wecomplish Platform was unavailable.

We know now that the unavailability was caused by a combination of PHP reaching its memory limit, which in turn caused the OPCache preloading to fail.

As a short term solution, OPCache preloading was disabled.

As a long term solution, the PHP memory limit has since been increased and OPCache preloading has been re-enabled.

Why was the Platform down for such a long time?

Once we understood what the issue was, fixing it was done in a matter of minutes. The were several reasons why it took so long to diagnose the problem:

Overlapping infrastructure incident

Our infrastructure vendor Platform.sh was experiencing a region outage (https://status.platform.sh/incidents/qwjjtz7zdt9m)) at the same time. We therefore assumed that the error was related to this incident and initially waited for them to resolve the issue.

Once they reported that the issue had been resolved and we were still seeing the error, we created a high priority ticket with Platform.sh.

Unfamiliar/hard to diagnose problem

The combination of issues were unfamiliar to both the Platform.sh representative and ourselves. The initial Platform.sh representative in charge of the ticket suspected that there was a problem processing the queue and spent some time investigating this issue, but it turned out to be a dead end.

It was not until the issue was escalated to a senior Platform.sh support representative (at 10:10) that a valid hypotheses was presented and the issue was mitigated.

Incorrect email recipient of ticket answers

The user account from which the ticket was created was registered with an email address no longer in use. This resulted in us initially taking a little more time to get back to Platform.sh with additional information on the ticket.

What have we learned

Incidents are always going to occur, and some will be more difficult to diagnose than others. Our main concern is making sure that downtime is as short as possible.

We have included in our troubleshooting documentation that Platform.sh should always be contacted right away when an incident occurs, and that we should request escalation of the ticket to a senior representative if it appears that the current representative is taking a long time figuring things out.

In addition, we have established this status page to facilitate transparent communication with our stakeholders during and after an incident.

Posted Jun 22, 2021 - 07:38 UTC

Resolved

The incident is considered resolved short-term as a result of disabling opcache preloading (which has a minimal effect on application performance). We will come back with a long-term fix and a post-mortem of the issue indicating which steps we will take in order to avoid a similar issue in the future.

Thank you for your patience and understanding!

Posted May 21, 2021 - 08:35 UTC

Monitoring

We have identified an issue which appears to be related top opcache preloading. We are disabling preloading in an attempt to fix the issue short term. This appears to make the application accessible again.

Posted May 21, 2021 - 08:31 UTC

Investigating

Clearing the queue turned out to be insufficient. The issue has been escalated to a senior infrastructure engineer for further investigation.

Posted May 21, 2021 - 08:13 UTC

Update

The infrastructure engineer is still working on clearing the queue.

Posted May 21, 2021 - 07:47 UTC

Identified

The hosting provider has identified en error processing items in the queue (the queue is the functionality processing time consuming requests). They are working on clearing the queue to get things back to normal.

Posted May 21, 2021 - 07:27 UTC

Investigating

The platform is unavailable and has been so since 08:28 Norwegian time. Our hosting provider is investigating the issue.

Posted May 21, 2021 - 07:20 UTC

This incident affected: SaaS Platform.