Platform Instability

Incident Report for Proto

Postmortem

A long-running migration script included a large batch insert but lacked proper transaction handling. One batch failed, which kept the entire transaction open. Since PostgreSQL uses MVCC (Multi-Version Concurrency Control), the uncommitted transaction held row-level locks, which prevented concurrent write operations.

Resolution:

  • Identified and rolled back the stuck transaction manually via pg_stat_activity and pg_terminate_backend.
  • Cleared pending jobs stuck due to the locked state.
  • Re-ran affected parts of the migration using safer batching logic with commit checkpoints.
Posted Jul 03, 2025 - 22:00 PDT

Resolved

The platform became unstable at approximately 11:06AM PHT, 2025/07/04, lasting for approximately 20 minutes. A data migration task triggered a high volume of queries, which caused the database to become unresponsive. The amount of data migrated was lessened and the database was stabilized. We will limit the volume of data processed during future migrations and continue monitoring to prevent recurrence.
Posted Jul 03, 2025 - 20:46 PDT

Identified

All systems are now operating normally. We will continue to monitor performance to ensure full stability.
Posted Jul 03, 2025 - 20:29 PDT

Update

We are continuing to investigate this issue.
Posted Jul 03, 2025 - 20:21 PDT

Update

We are continuing to investigate this issue.
Posted Jul 03, 2025 - 20:19 PDT

Investigating

We’re investigating increased Internal Server Errors (500) across the platform. We appreciate your patience.
Posted Jul 03, 2025 - 20:15 PDT
This incident affected: Hosting (AWS ec2-eu-central-1, AWS rds-eu-central-1, AWS s3-eu-central-1, AWS elb-eu-central-1, Cloudflare CDN/Cache, Cloudflare Pages, Cloudflare Load Balancing and Monitoring, Cloudflare Infrastructure, Cloudflare Gateway), Channels & Integrations (Webchat, Africa's Talking, Bitrix24, LINE, Messenger, WhatsApp, Telegram), and Dashboard, Inbox, AI Assistants, Livechats, Tickets, People.