JSON API Action Failure Due to Proxy Downtime

Incident Report for Proto

Postmortem

Root Cause

Azure reported an unexpected failure on the physical host of the virtual machine running the proxy service. As a result, the VM was automatically redeployed to a different host by Azure’s auto-recovery system.

After redeployment, the proxy did not automatically restart, leading to continued proxy unavailability. Since the JSON API Action in the chatbot platform is configured to route all outbound requests through this proxy for static IP purposes, all such requests failed with connection errors (e.g., 403, Connection Refused, or Timeout).

Resolution Steps

  1. Azure Auto-Recovery Triggered:
    Azure initiated a VM redeployment on June 29 due to a detected hardware failure.
  2. Post-Redeployment Validation Missed:
    The proxy process did not automatically restart after the VM was moved and booted.
  3. Manual Intervention Performed:
    On June 30, engineers identified the root cause and manually restarted the proxy service.
  4. Service Verified:
    Successful outbound proxy calls were verified using curl and internal platform tests.

Preventive Actions

Action Item Owner Timeline
Configure auto-restart proxy on boot or failure DevOps Team This week
Implement health check and alert for proxy service status DevOps Team This week
Add fallback logic to JSON API Action when proxy is unreachable Backend Team This week
Migrate the proxy service to AWS with better monitoring and alert notification setup DevOps Team This week

Lessons Learned

  • Auto-recovery of infrastructure by Azure does not guarantee application-level readiness.
  • Critical services such as proxy layers should be monitored with service-level health checks post-recovery.
  • JSON API actions should have retry/backoff or graceful degradation paths in case of proxy unavailability.
Posted Jun 29, 2025 - 23:46 PDT

Resolved

Affected Component:
- JSON API Action feature in chatbot workflows
- Azure-hosted virtual machine serving the proxy

Impact Summary:
All API calls made through the JSON API Action using the proxy service failed during this period. This resulted in failed external HTTP(S) requests, particularly to services that require a static outbound IP for whitelisting.
Posted Jun 29, 2025 - 18:00 PDT