Saturday the 7th of August, we had a disruption of our services from 12.35 to 13.15 GMT+1.
At around 12:10 GMT+1, one database server in a cluster malfunctioned in such a way that it appeared to work, though it was stalling all requests. It made requests pile up on the remaining servers as they were waiting for confirmation from the failed server. Eventually, this made the entire cluster unresponsive.
At 12:35 GMT+1, the failed database server was restored to a functional state. Still, the sudden rush of the requests that had been queued caused an otherwise unrelated process to fail in a way that made a secondary database run at 100% CPU capacity. Requests were coming through, but slowly.
At 13:15 GMT+1, this process had been identified and eliminated to resume service as usual.
We apologize for the disturbance of service, and we are looking at ways to prevent this error from occurring again.