Service interruptions: degraded performance and platform instability
Incident Report for Proto
Postmortem

Starting on October 7th, Proto’s AICX Platform experienced instability in its Livechat module.

An incident was raised: GKE & Service interruptions: degraded performance and platform instability with the root cause identified as a failed upgrade within our vendor, Google Kubernetes Engine (GKE): https://status.cloud.google.com/incidents/WMmjrixdPfBGFKCohYGd

The disruption from GKE was a constant failure in its nodepool upgrades, causing Proto’s Livechat module to show continuous loading. Proto implemented the workarounds advised by GKE on a 24/7 basis.

GKE reported resolution on Oct 12th. However, Proto still experienced implications for certain microservices. These implications had no advanced warning from GKE, and caused susceptibility to 600+ malicious IPs, bots and crawlers. This susceptibility caused the persistence of the continuous loading within Proto’s Livechat module.

Proto’s internal and third-party cybersecurity experts immediately imposed strict security measures, including IP and geography banning. These strict measures provided for immediate security and platform stability for our clients, as our engineering team worked to investigate, identify, and resolve the underlying microservice implications caused by GKE’s unannounced upgrade.

We sincerely apologize for the inconvenience, and are in direct contact with Google Cloud regarding the assessment of service credits and assurances of future forewarning regarding its services.

Posted Oct 25, 2023 - 09:30 PDT

Resolved
We have monitored the platform to be stable for 48+ hours.
Incident has been resolved.

We will undergo additional maintenance, in which a new incident will be raised to notify of its progress.
The maintenance process should not affect users.
Posted Oct 25, 2023 - 08:31 PDT
Monitoring
We believe to have found the root cause.
We have deployed a fix, and the platform has been stable for 24+ hours since.

We will continue monitoring 24/7 until the incident is fully deemed resolved.
Posted Oct 24, 2023 - 11:07 PDT
Update
We are undergoing maintenance.
Users may experience short instability during the maintenance process.
Posted Oct 22, 2023 - 23:54 PDT
Identified
The messaging system vulnerabilities following the Google Cloud GKE incident have been isolated. Proto has taken measures to restore Chatbots and Livechat module stability. Fixes for the vulnerabilities are in process. There is confirmed stability across all modules, however, clients may see momentary instability due to our maintenance process. Stand by for confirmation of the completion of maintenance. We are in contact with Google Cloud regarding resolution and receipt of service credits.
Posted Oct 21, 2023 - 14:15 PDT
Update
Investigation for Philippines region:

Investigation is complete for the Philippines region and IP blocking has been removed.
Posted Oct 21, 2023 - 09:59 PDT
Investigating
Investigation for Philippines region:

We detected potential suspicious activity coming from the Philippines region and will be conducting feature restrictions over the weekend that may affect the Chatbots and Livechat modules for select clients without IP whitelisting enabled. We apologise for the inconvenience and will update this page when maintenance is complete.
Posted Oct 20, 2023 - 22:29 PDT
Update
We are implementing various security measures. Livechat and Chatbots modules may be unstable.
Posted Oct 20, 2023 - 21:31 PDT
Monitoring
Proto has imposed immediate security measures and continues to monitor for further suspicious activity.
Posted Oct 20, 2023 - 17:55 PDT
Update
We are undergoing measures to continue investigating into the issue. Livechat and Chatbots modules currently remain unstable.
Posted Oct 20, 2023 - 15:20 PDT
Update
We have noticed that the Chatbots and Livechat modules are taking an exceeding amount of time to load. Our security team is currently investigating the root cause and restarting underlying services.
Posted Oct 20, 2023 - 12:24 PDT
Investigating
We have identified new findings regarding the current ongoing incident as follows:

Following the Google Cloud GKE incident, Proto's engineering team detected large spikes from multiple abusive IPs targeting client chatbot deployments. This includes SQL injection attempts, brute force and automated attacks. We suspect both automated and manually targeted attempts to the potential vulnerabilities created by the Google Cloud GKE incident.

- We are in contact with Google Cloud.
- Proto has blocked abusive IPs and imposed immediate security measures while continuing to monitor the platform 24/7.
- Cyber forensic/security experts have been brought in for further investigation.

Proto will continue to deliver updates during the investigation. We thank you for your patience and apologize for the negative impacts of this incident.
Posted Oct 20, 2023 - 07:58 PDT
Update
We have imposed a fix and continue monitoring 24/7 for any instability.
Posted Oct 20, 2023 - 03:49 PDT
Update
We have noticed that Livechat module is taking an exceeding amount of time to load. The engineering team is currently investigating.
Posted Oct 19, 2023 - 18:06 PDT
Update
We have imposed a fix and continue monitoring 24/7 for any instability.
Posted Oct 19, 2023 - 13:23 PDT
Update
We have noticed that Livechat, and parts of the platform are taking an exceeding amount of time to load.
The engineering team is currently investigating into the issue.
Posted Oct 19, 2023 - 10:41 PDT
Update
We are continuing to monitor for any further issues.
Posted Oct 19, 2023 - 08:34 PDT
Monitoring
We are continuing to monitor for any further issues.
Posted Oct 19, 2023 - 08:06 PDT
Investigating
We're receiving reports of service interruptions and experienced degraded performance on our platform. We are currently investigating the root cause of this issue.
Posted Oct 19, 2023 - 01:53 PDT
This incident affected: Platform (Chatbots, Livechat, Tickets, Analytics, Hosting, Company).