Degraded performance and partial outage

Incident Report for Proto

Postmortem

A large traffic surge of about 100~150 times the regular traffic was detected recently, causing majority of our systems to crash and restart repetitively. The auto-scaling on our Kubernetes CPU was not able to accommodate to such acceleration and volume of requests, causing our pods to remain affected after the surge.

We are also investigating into the possibility of an infinite loop/recurrence/memory leak, as well as attacks.

The engineering team quickly remediated the issue by killing, and restarting all affected pods while investigating into the issue.
We are currently implementing a solution on our auto-scalers to behave more effectively in such events.

We apologize for the negative impacts caused by this sudden event, and the engineering team will continue to monitor, and resolve the areas of concern.

Posted Jun 26, 2023 - 11:39 PDT

Resolved

Our services are continuing to be in a fully restored, and operational state.
The engineering team has identified the cause, and is currently implementing measures to prevent similar events.

We will mark this issue as resolved.

Posted Jun 26, 2023 - 11:19 PDT

Update

Our services are continuing to be in a fully restored, and operational state.

We will continue monitoring, and leave the issue open until we have found more explicit post-mortems to share with you.

Posted Jun 26, 2023 - 01:48 PDT

Monitoring

We have restarted our services, and currently noticing signs of full recovery. We will continue monitoring until the issue may be deemed as fully resolved.

We are also still investigating into the cause of the issue.

Posted Jun 25, 2023 - 23:59 PDT

Investigating

We are receiving reports of degraded performance and partial outage for Livechat and our chatbots. We are currently investigating into the issue.

Posted Jun 25, 2023 - 23:30 PDT

This incident affected: Platform (Chatbots, Livechat).