Incident Review: Intermittent Disruption and Degradation of Heroku Services on October 20, 2025

Summary – What Happened?

Heroku Service Impact

Beginning on October 20th, 2025 07:06 UTC, multiple Heroku services were initially impacted by a database service disruption with our cloud infrastructure provider in the Virginia region. This affected the monitoring, provisioning, networking and availability of cloud infrastructure services that power the Heroku platform.

This outage was a complex, multi-phased event primarily triggered by multiple upstream outages within the cloud infrastructure provider Virginia region, and compounded by Heroku’s single-region control plane architecture and automation failures. The service disruptions can be divided into two distinct phases based on root causes, timeline, and impacts.

**Phase 1. Upstream Cloud Infrastructure Provider Da…

Summary – What Happened?

Heroku Service Impact

Phase 1. Upstream Cloud Infrastructure Provider Database Outage (10/20/2025 06:53 UTC – 09:30 UTC)

During phase 1, Heroku’s cloud infrastructure provider’s database service disruption and subsequent throttling resulted in downstream impacts on Heroku services and some customers, including:

Postgres, Key-Value Store, and hosted Apache Kafka provisioning delays
Disruption of Heroku’s App Metrics, Autoscaling, and Threshold Alerting due to the inability to read or write metrics to the metrics backend database
Degradation of Platform API and control plane operations resulting in failed dyno and app management requests triggered through the CLI and Dashboard
Common Runtime and Private Space dyno provisioning failures and high latency, impacting Builds, deployments, Release Phase, One-off Dynos, and Continuous Integration

Heroku’s Platform API circuit breaker was triggered to block calls to the Runtime control plane, resulting in failed requests to the dyno and app management endpoints. At 09:30 UTC, the cloud provider announced that initial database recovery was complete. This restored CLI and Dashboard operations, as well as Builds.

Phase 2. Upstream Cloud Infrastructure Provider Networking Outage, Heroku Control Plane Failure, and Final Incident Cleanup (10/20/2025 11:27 UTC – 10/21/2025 03:33 UTC)

Phase 2 began at 11:27 UTC when Heroku identified dyno autoscaling issues on Private Spaces caused by the cloud infrastructure provider’s ongoing database issues. Additionally, Heroku began detecting networking degradation. At 13:21 UTC, the cloud infrastructure provider announced a second major service degradation, this time impacting network routing, including customers’ applications with custom domains. These upstream provider issues, combined with Heroku automation and control plane architectural issues, resulted in multiple downstream impacts to Heroku services and customers. During this phase, some customers may have observed the following:

The inability to access applications, in particular those using custom domains, resulting in H99 platform errors and 5xx status code errors for client applications
False positive or delayed Postgres, Key-Value Store, and Kafka database health alerts
Before autoremediation was manually disabled, a small subset of customer databases began autofailover and replacement.
Disruption of App Metrics, Dyno Autoscaling, and Threshold Alerting
Intermittent failures, and later disruption of dyno and app management operations due to the deteriorating health of the Platform API’s Key Value Store
Build, deployment, CLI and Dashboard failures
Release Phase, Continuous Integration, and App Setup failures
Delay in webhook and event delivery
Disruption or delays in Connect synchronizations in the Virginia region
AppLink request delays and disruptions

At 19:11 UTC the provider database degraded failover was manually remediated, the Heroku Platform API recovered, and CLI and Dashboard initiated dyno and app management operations and metrics-driven functionality (App Metrics, Dyno Autoscaling, and Threshold Alerting) began recovering.

Following the Heroku Platform API recovery, a long-tail cleanup effort was undertaken to fully restore Heroku services and mitigate customer impact, including the following activities:

All queued webhooks were processed and distributed to customer endpoints once network connectivity was restored.
The technology team assessed and manually corrected partial failover and replacement operations auto-triggered by the loss in network connectivity.
Control plane instances were cycled to remediate any residual issues from internal database failovers and replacements, restoring dyno provisioning, scaling and cycling operations.
Re-enabling of Data service auto-remediation

After impact remediation and monitoring was complete for all Heroku services, the incident was resolved on October 21st, 2025 at 03:33 UTC.

Heroku Incident Communications

Some customers expressed concern with the Heroku-specific update cadence and level of detail.

Detailed Timeline of Impacts, Investigation, and Root Cause

Phase 1. Upstream Cloud Infrastructure Provider Database Outage (10/20/2025 06:53 UTC – 09:30 UTC)

October 20th, 2025 06:53 UTC. Heroku engineering was first alerted to failures in writing application and language metrics to Heroku’s cloud infrastructure provider hosted database. This impacted Heroku’s metrics-driven features, including Application Metrics, Dyno Autoscaling, and Threshold Alerting features. At the same time, Heroku Runtime alerts signaled degradation of Heroku’s Platform API and control plane due to the upstream cloud platform provider’s database failure. API requests for dyno management operations, like scaling, stopping and restarting, began to fail or display very high latency.

October 20th, 2025 07:11 UTC. Heroku’s cloud infrastructure provider notified Heroku and their other customers of increased error rates when interacting with specific platform APIs.

October 20th, 2025 07:00 UTC. Heroku’s Postgres, KVS and Apache Kafka Data services experienced significant provisioning latency in the Virginia region. Some customers observed provisioning request delays of up to 4 hours in duration.

October 20th, 2025 07:26 UTC. Dyno provisioning failures due to upstream cloud infrastructure provider database issues resulted in increased latency for Release Phase, Heroku CI, and postdeploy scripts. Some customers observed deployment failures or slow deployments due to slow-running post-deploy scripts.

October 20th, 2025 07:40 UTC. The Heroku Platform API circuit breakers triggered to block requests to the Runtime control plane. Heroku Platform API started returning error messages for the dyno API endpoints and other app endpoints. During this time, some customer requests to obtain dyno information, or stop, start and scale dynos failed and returned a 500 or 503 status code.

October 20th, 2025 07:42 UTC. The first incident-related support ticket was logged from a customer who reported missing metrics in the App Metrics dashboard.

October 20th, 2025 09:00 UTC. Heroku’s Postgres, KVS and Apache Kafka Data services provisioning latency in the Virginia region returned to normal.

October 20th, 2025 09:15 UTC. High request retries were observed for Heroku Runtime. Heroku’s cloud provider outage affected dyno provisioning in Private Spaces apps. New Private Space dynos, including one-off dynos, could not be created. Existing dynos could not be scaled or cycled.

October 20th, 2025 09:30 UTC. Heroku’s cloud infrastructure provider updated their status indicating signs of recovery. Platform API operations were recovered, with CLI and Dashboard operations, as well as Builds, functionality restored. Mitigated services were being closely monitored for signs of stable recovery.

Phase 2. Upstream Cloud Infrastructure Provider Networking Outage, Heroku Control Plane Failure, and Final Incident Cleanup (10/20/2025 11:27 UTC – 10/21/2025 03:33 UTC)

October 20th, 2025 11:27 UTC. Heroku detected autoscaling failures due to the ongoing service disruption of Heroku’s cloud provider database, resulting in the inability for Heroku’s internal metrics aggregating service to query metrics that trigger autoscaling.

October 20th, 2025 12:15 UTC. Alerts for Heroku webhooks signaled increasing queue depth as Heroku’s platform provider’s networking also degraded. Some customers began to experience disruptions or delays in webhook delivery to customer endpoints, resulting in downstream process delays.

October 20th, 2025 13:21 UTC. Heroku’s cloud infrastructure provider announced a second major service degradation, this time impacting network routing. Some customers with Virginia region apps, particularly those with custom domains, began observing H99 platform errors and 5xx responses on client applications.

October 20th, 2025 13:38 UTC. Heroku’s cloud provider network routing degradation triggered Heroku Connect synchronization performance alerts. Some customers began observing synchronization delays between Salesforce organizations and Heroku Postgres in the Virginia region.

October 20th, 2025 14:55 UTC. Heroku’s Platform API Key Value Store experienced a partial failover, resulting in the inability of Heroku’s Data control plane to accurately determine service health due to Heroku’s cloud provider’s network connectivity issues. Auto-remediation was triggered, resulting in a partial failover. Customers began observing intermittent failures in dyno and app management operations.

October 20th, 2025 15:05 UTC. Heroku’s Data team paused auto-remediation of databases due to the inability to accurately assess the health of existing Data services as a result of the cloud platform vendor’s network disruption. Customers received false-positive database health notifications due to the loss of network connectivity. Prior to the auto-remediation pause, a small subset of customer databases began autoremediation, but for most database customers no action was taken.

October 20th, 2025 16:20 UTC. Alerts were triggered for Heroku’s Credentials Service due to the degraded Heroku Platform API and Credentials Service. AppLink requests failed, and this event signaled a major degradation of the Heroku Platform API.

October 20th, 2025 16:30 UTC. Heroku Platform API degraded due to the partially failed-over Key Value Store. This caused failures of Heroku Builds, Dashboard, and CLI operations.

October 20th, 2025 17:05 UTC. Due to Heroku’s Platform API degradation, Release Phase, Continuous Integration, and deployments failed or stalled.

October 20th, 2025 18:16 UTC. Heroku’s Platform API experienced a near disruption, resulting in failure of most CLI and Dashboard initiated dyno and app management operations. Some customers reported unexpected Dashboard errors.

October 20th, 2025 19:11 UTC. Heroku’s Data team manually corrected the Heroku Platform API’s Key Value Store that was stalled in a partial failover state.

October 20th, 2025 19:18 UTC. The Heroku Platform API began to recover. CLI and Dashboard executed dyno and app management operations began to successfully complete.

October 20th, 2025 19:21 UTC. The Heroku metrics backend that drives App Metrics, Autoscaling, and Threshold Alerting was recovered, enabling the services to resume operations. Metrics gaps remained for the outage period in App Metrics.

October 20th, 2025 20:00 UTC. Heroku Webhooks was fully recovered after network connectivity was restored and the event backlog was cleared. Queued webhooks were processed and sent to customer endpoints.

October 20th, 2025 20:15 UTC. The technology team began a manual clean up of the small subset of customer databases that began autoremediation when the loss of network connectivity triggered automation prior to when autoremediation was disabled. The incomplete database failovers and replacements were caused by the Heroku Platform API outage that left some databases in a partially remediated state. Due to the emergency maintenance situation, some customers did not receive a notification, or received a last minute notification, of the database recovery operation.

October 20th, 2025 20:25 UTC. The technology team began cycling control plane instances to clear residual issues resulting from database failover and and replacement failures.

October 20th, 2025 23:05 UTC. To remediate partially completed database failovers and replacements and ensure services were operational, Data service auto-remediation was paused and unpaused.

October 21st, 2025 00:09 UTC. Heroku control plane instance cycling was completed. Dyno provisioning, scaling, and cycling were restored for Private Spaces and Common Runtime.

October 21st, 2025 03:33 UTC. The recovered Heroku service heightened monitoring period was completed, and the incident was resolved.

Corrective Actions

Areas of Improvement

Our retrospective identified several key areas of improvement:

In this case, Heroku Platform API and Data control plane circuit breakers intended to insulate the platform from large-scale infrastructure outages did not sufficiently control impact by not triggering early enough.
The Heroku control plane being located in a single region – the same as the affected region – meant that during the impact timeframe, automated remediation for Data service failures could not proceed without required manual intervention.

During our review, we found opportunities to improve how we communicated during this incident:

Update cadence did not fully align with expectations.
Communications did not always provide the level of detail needed to clearly outline customer impact and any recommended actions.

Planned Corrective Actions

To address the root cause findings, Heroku commits to the following corrective actions to help remediate the potential of these issues from occurring in the future:

Short-term

Control Plane Circuit Breakers. To address the delay in triggering the Platform API and Data control plane circuit breakers, we will be immediately working on improving the “circuit breakers” to ensure they trigger in a timely and appropriately scoped manner. This is to improve fault tolerance.

Incident Communications Process Optimization. Improve incident communication consistency by reinforcing existing comms standards for cadence, validated impact, observed behavior, and recommended actions.

Longer-term

Control Plane Resiliency for Data Services. To address the single region Data services control plane vulnerability, we will investigate distributing the observability tooling that the control planes rely on across multiple regions to mitigate false-positive signals in the case of a regional outage. We will also investigate longer term remediation of over-reliance on a single region of our infrastructure provider.

Control Plane Resiliency for Other Platform Services. We will be exploring ways to add additional resiliency for strategic platform components, including the Platform API.

Our commitment to you

We sincerely apologize for the impact this incident caused you and your business. Our goal is to provide world-class service to our customers, and we are continuously evaluating and improving our tools, processes, and architecture to provide our customers with the best service possible.

If you have additional questions or require further support, please open a case via the Heroku Help portal.

Summary – What Happened?

Heroku Service Impact

**Phase 1. Upstream Cloud Infrastructure Provider Da…

Summary – What Happened?

Heroku Service Impact

Phase 1. Upstream Cloud Infrastructure Provider Database Outage (10/20/2025 06:53 UTC – 09:30 UTC)

Phase 2. Upstream Cloud Infrastructure Provider Networking Outage, Heroku Control Plane Failure, and Final Incident Cleanup (10/20/2025 11:27 UTC – 10/21/2025 03:33 UTC)

Heroku Incident Communications

Detailed Timeline of Impacts, Investigation, and Root Cause

Phase 1. Upstream Cloud Infrastructure Provider Database Outage (10/20/2025 06:53 UTC – 09:30 UTC)

Phase 2. Upstream Cloud Infrastructure Provider Networking Outage, Heroku Control Plane Failure, and Final Incident Cleanup (10/20/2025 11:27 UTC – 10/21/2025 03:33 UTC)

Corrective Actions

Areas of Improvement

Planned Corrective Actions

Short-term

Longer-term

Our commitment to you

Similar Posts