Disruption of service

Incident Report for Hello Retail

Postmortem

Saturday the 7th of August, we had a disruption of our services from 12.35 to 13.15 GMT+1.

At around 12:10 GMT+1, one database server in a cluster malfunctioned in such a way that it appeared to work, though it was stalling all requests. It made requests pile up on the remaining servers as they were waiting for confirmation from the failed server. Eventually, this made the entire cluster unresponsive.

At 12:35 GMT+1, the failed database server was restored to a functional state. Still, the sudden rush of the requests that had been queued caused an otherwise unrelated process to fail in a way that made a secondary database run at 100% CPU capacity. Requests were coming through, but slowly.

At 13:15 GMT+1, this process had been identified and eliminated to resume service as usual.

We apologize for the disturbance of service, and we are looking at ways to prevent this error from occurring again.

Posted Aug 09, 2021 - 15:29 CEST

Resolved

This incident has been resolved.

Posted Aug 07, 2021 - 13:27 CEST

Update

The services are now stabilized, and we will continue to monitor the solution. We are looking into ways to prevent issues like this from arising in the future.

Thank you very much for your patience, and sorry for any inconvenience it has had to you and your customers.

Posted Aug 07, 2021 - 13:26 CEST

Investigating

We are unfortunately experiencing issues with the service. We are working on a fix.

Sorry for any inconvenience that it is causing.

Posted Aug 07, 2021 - 12:28 CEST

This incident affected: Product recommendations, Newsletter content, Search, and Triggered email.