Traffic overload in Bot and Gateway

Incident Report for Proto

Postmortem

Recently, there was a large traffic surge of about 4~5 times the regular traffic in less than an hour, caused by different events and campaigns held in the Asia region. Our Database CPU’s were only setup to anticipate 3~4 times the regular traffic, which consequently lead to this throttling impact.

This sudden surge had suddenly overloaded our Bot and Gateway services, causing degraded performance, and even downtime for certain clients/endusers using our chatbots.

The engineering team quickly upgraded our resources, by increasing the Database CPU threshold to 9~10 times the regular traffic, and horizontally scaling our Kubernetes pods to remediate the processing of operations/requests for the Database.

Since the upgrade, our services had recovered to normal.
This incident had impacted our users for 1 hour, and we sincerely apologize for the sudden inconvenience and impact caused by this occurrence. We promise to be more anticipative, and robust towards such hazards, and potential dangers in the future.

Thank you for your patience and cooperation.

Posted Jun 18, 2023 - 13:01 PDT

Resolved

Proto has confirmed that all services have recovered.
We appreciate your patience, and apologize for all inconvenience caused by this outage.

Posted Jun 18, 2023 - 12:06 PDT

Monitoring

We have completed upgrading our resources, and are receiving reports of recovery.
We will continue monitoring to make sure our services continue to be stable.

Posted Jun 18, 2023 - 11:13 PDT

Identified

Proto is receiving reports that bots have been impacted, and are being unresponsive.
We are investigating into this issue, and have identified that there is a large spike in traffic due to certain events and campaigns being held in Asia.

To accommodate for this sudden rush, we are currently increasing our resources, and this may lead to approximately 15 minutes of downtime across our services. We apologize for this sudden inconvenience and phenomenon.

Posted Jun 18, 2023 - 10:15 PDT