On May 21st, between 08:28 and 10:30 Norwegian time, the Wecomplish Platform was unavailable.
We know now that the unavailability was caused by a combination of PHP reaching its memory limit, which in turn caused the OPCache preloading to fail.
As a short term solution, OPCache preloading was disabled.
As a long term solution, the PHP memory limit has since been increased and OPCache preloading has been re-enabled.
Once we understood what the issue was, fixing it was done in a matter of minutes. The were several reasons why it took so long to diagnose the problem:
Our infrastructure vendor Platform.sh was experiencing a region outage (https://status.platform.sh/incidents/qwjjtz7zdt9m)) at the same time. We therefore assumed that the error was related to this incident and initially waited for them to resolve the issue.
Once they reported that the issue had been resolved and we were still seeing the error, we created a high priority ticket with Platform.sh.
The combination of issues were unfamiliar to both the Platform.sh representative and ourselves. The initial Platform.sh representative in charge of the ticket suspected that there was a problem processing the queue and spent some time investigating this issue, but it turned out to be a dead end.
It was not until the issue was escalated to a senior Platform.sh support representative (at 10:10) that a valid hypotheses was presented and the issue was mitigated.
The user account from which the ticket was created was registered with an email address no longer in use. This resulted in us initially taking a little more time to get back to Platform.sh with additional information on the ticket.
Incidents are always going to occur, and some will be more difficult to diagnose than others. Our main concern is making sure that downtime is as short as possible.
We have included in our troubleshooting documentation that Platform.sh should always be contacted right away when an incident occurs, and that we should request escalation of the ticket to a senior representative if it appears that the current representative is taking a long time figuring things out.
In addition, we have established this status page to facilitate transparent communication with our stakeholders during and after an incident.