Proactive RDS Monitoring: How We Built an EventBridge-Powered Alert System for Flawless BFCM Events

Stop database issues before they become customer-facing problems

The Problem: Silent Database Failures During Peak Traffic

Picture this: It’s Black Friday, and your e-commerce platform is handling 10x the normal traffic. Everything seems fine—your application servers are scaling beautifully, your CDN is caching like a champ, and your monitoring dashboards show green across the board.

Then, suddenly, customer complaints start flooding in. Orders are timing out. Checkout pages are loading slowly. Your database is silently struggling, but you don’t know it yet.

The culprit? RDS performance degradation that wasn’t being caught by traditional monitoring.

Why Traditional Monitoring Falls Short

AWS RDS generates events for critical issues, but these events often go unnot…

Stop database issues before they become customer-facing problems

The Problem: Silent Database Failures During Peak Traffic

Then, suddenly, customer complaints start flooding in. Orders are timing out. Checkout pages are loading slowly. Your database is silently struggling, but you don’t know it yet.

The culprit? RDS performance degradation that wasn’t being caught by traditional monitoring.

Why Traditional Monitoring Falls Short

AWS RDS generates events for critical issues, but these events often go unnoticed:

RDS-EVENT-0189: Burst balance exhaustion (IOPS performance degradation)
RDS-EVENT-0225: Storage threshold at 80% (approaching auto-scaling limits)

These events are published to EventBridge, but without proper routing, they’re just... sitting there. By the time you notice performance degradation in your application metrics, it’s already impacting users.

During high-traffic events like Black Friday Cyber Monday (BFCM), this delay can mean the difference between a successful sale and a lost customer.

The Solution: EventBridge + Lambda + Slack = Proactive Alerts

We built a lightweight, serverless monitoring system that:

✅ Catches issues immediately - EventBridge triggers Lambda within seconds of RDS events

✅ Provides actionable context - Rich Slack notifications with step-by-step resolution guides

✅ Scales automatically - No infrastructure to manage, pay-per-use pricing

✅ Integrates seamlessly - Works with existing Terraform infrastructure

Architecture Overview

┌─────────────┐
│   AWS RDS   │ ──[Event]──> ┌──────────────┐
│  Instance   │              │ EventBridge  │
└─────────────┘              └──────┬───────┘
│
│ Triggers
▼
┌─────────────────┐
│  Lambda Function│
│  (Python 3.12)  │
└───────┬─────────┘
│
│ Formats & Sends
▼
┌─────────────────┐
│  Slack Channel  │
│  (Rich Alerts)  │
└─────────────────┘

Key Components:

EventBridge Rule: Filters for RDS-EVENT-0189 and RDS-EVENT-0225
Lambda Function: Processes events and formats Slack messages
SSM Parameter Store: Securely stores Slack webhook URL
CloudWatch Logs: Captures all Lambda execution logs

Implementation: Infrastructure as Code

Let’s build this step by step. We’ll use Terraform for infrastructure and Python for the Lambda function.

Step 1: Create the Terraform Module

First, let’s create a reusable Terraform module:

# terraform/modules/rds-important-events-monitor/main.tf

resource "aws_iam_role" "rds_important_events_monitor" {
count = var.create_iam_role ? 1 : 0

name = "${var.function_name}-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}

resource "aws_iam_role_policy" "lambda_logs" {
count = var.create_iam_role ? 1 : 0

name = "${var.function_name}-logs-policy"
role = aws_iam_role.rds_important_events_monitor[0].id

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:${data.aws_caller_identity.current.account_id}:*"
}
]
})
}

resource "aws_iam_role_policy" "ssm_read" {
count = var.create_iam_role ? 1 : 0

name = "${var.function_name}-ssm-policy"
role = aws_iam_role.rds_important_events_monitor[0].id

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ssm:GetParameter",
"ssm:GetParameters"
]
Resource = "arn:aws:ssm:${var.aws_region}:${data.aws_caller_identity.current.account_id}:parameter${var.slack_webhook_url_parameter}"
}
]
})
}

resource "aws_lambda_function" "rds_important_events_monitor" {
depends_on = [
aws_cloudwatch_log_group.lambda_logs
]

function_name = var.function_name
description   = "Monitor RDS issues and send Slack notifications"
role          = var.create_iam_role ? aws_iam_role.rds_important_events_monitor[0].arn : var.lambda_role_arn
handler       = var.handler
runtime       = var.runtime
timeout       = var.timeout
memory_size   = var.memory_size

filename         = "${var.source_path}/../lambda_function.zip"
source_code_hash = filebase64sha256("${var.source_path}/../lambda_function.zip")

environment {
variables = {
SLACK_WEBHOOK_URL = data.aws_ssm_parameter.slack_webhook_url.value
}
}

tags = {
Name        = var.function_name
Description = "Monitors RDS events and sends Slack notifications"
}
}

resource "aws_cloudwatch_log_group" "lambda_logs" {
name              = "/aws/lambda/${var.function_name}"
retention_in_days = var.cloudwatch_logs_retention_in_days
}

# EventBridge Rule to capture RDS events
resource "aws_cloudwatch_event_rule" "rds_iops_events" {
name        = "${var.function_name}-events"
description = "Capture RDS events RDS-EVENT-0189 and RDS-EVENT-0225 for IOPS-related/StorageSize issues"

event_pattern = jsonencode({
source      = ["aws.rds"]
detail-type = ["RDS DB Instance Event"]
detail = {
EventID = [
"RDS-EVENT-0189",
"RDS-EVENT-0225"
]
}
})

tags = {
Name = "${var.function_name}-events"
}
}

# Connect EventBridge to Lambda
resource "aws_cloudwatch_event_target" "lambda" {
target_id = "RDSIOPSMonitorLambdaTarget"
rule      = aws_cloudwatch_event_rule.rds_iops_events.name
arn       = aws_lambda_function.rds_important_events_monitor.arn
}

# Allow EventBridge to invoke Lambda
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id  = "AllowExecutionFromEventBridge"
action        = "lambda:InvokeFunction"
function_name = aws_lambda_function.rds_important_events_monitor.function_name
principal     = "events.amazonaws.com"
source_arn    = aws_cloudwatch_event_rule.rds_iops_events.arn
}

# Data source to fetch Slack webhook from SSM
data "aws_ssm_parameter" "slack_webhook_url" {
name = var.slack_webhook_url_parameter
}

data "aws_caller_identity" "current" {}

Step 2: Module Variables

# terraform/modules/rds-important-events-monitor/vars.tf

variable "function_name" {
description = "Name of the Lambda function"
type        = string
default     = "rds-important-events-monitor"
}

variable "runtime" {
description = "Lambda runtime"
type        = string
default     = "python3.12"
}

variable "source_path" {
description = "Path to the Lambda source code directory"
type        = string
}

variable "handler" {
description = "Lambda handler"
type        = string
default     = "app.lambda_handler"
}

variable "timeout" {
description = "Lambda timeout in seconds"
type        = number
default     = 30
}

variable "memory_size" {
description = "Lambda memory size in MB"
type        = number
default     = 256
}

variable "slack_webhook_url_parameter" {
description = "SSM Parameter Store path for Slack webhook URL"
type        = string
default     = "/us-east-1/rds-monitor/slack-webhook-url"
}

variable "aws_region" {
description = "AWS region"
type        = string
default     = "us-east-1"
}

variable "cloudwatch_logs_retention_in_days" {
description = "CloudWatch Logs retention in days"
type        = number
default     = 7
}

variable "create_iam_role" {
description = "Whether to create an IAM role for the Lambda function"
type        = bool
default     = true
}

variable "lambda_role_arn" {
description = "ARN of an existing IAM role to use for the Lambda function"
type        = string
default     = null
}

Step 3: Use the Module

# terraform/us-east-1/rds-important-events-monitor.tf

module "rds_important_events_monitor" {
source          = "../modules/rds-important-events-monitor"
source_path     = "../../rds-important-events-monitor/src"
create_iam_role = false
lambda_role_arn = "arn:aws:iam::YOUR_ACCOUNT_ID:role/rds-important-events-monitor"
}

The Lambda Function: Intelligent Alert Formatting

The Lambda function does more than just forward events—it provides actionable, context-rich notifications that help your team respond quickly.

Complete Lambda Code

# rds-important-events-monitor/src/app.py

import json
import os
import logging
import requests

logger = logging.getLogger()
logger.setLevel(logging.INFO)

aws_region = os.environ.get('AWS_REGION', 'us-east-1')


def lambda_handler(event, context):
logger.info(f"Received event: {json.dumps(event)}")

try:
detail = event.get('detail', {})
source_identifier = detail.get('SourceIdentifier', '')
event_message = detail.get('Message', '')
event_id = detail.get('EventID', '')
timestamp = event.get('time', '')
region = event.get('region', aws_region)
account_id = event.get('account', '')

expected_events = ['RDS-EVENT-0189', 'RDS-EVENT-0225']
if event_id not in expected_events:
logger.warning(f"Received unexpected event ID: {event_id}, expected one of {expected_events}")
return {
'statusCode': 200,
'body': json.dumps(f'Unexpected event ID: {event_id}')
}

db_name = source_identifier

if not db_name:
logger.warning("Could not extract DB name from event")
return {
'statusCode': 200,
'body': json.dumps('Event processed but no DB name found')
}

logger.info(f"Processing {event_id} for database: {db_name}")

message = format_rds_event_slack_message(
db_name=db_name,
event_message=event_message,
event_id=event_id,
timestamp=timestamp,
region=region,
account_id=account_id
)
success = post_to_slack(message)

if success:
logger.info(f"Successfully sent Slack notification for {db_name}")
else:
logger.error(f"Failed to send Slack notification for {db_name}")

return {
'statusCode': 200 if success else 500,
'body': json.dumps({
'message': 'Notification sent' if success else 'Notification failed',
'db_name': db_name,
'event_message': event_message
})
}

except Exception as e:
logger.error(f"Error processing event: {str(e)}", exc_info=True)
return {
'statusCode': 500,
'body': json.dumps(f'Error: {str(e)}')
}


def format_rds_event_slack_message(db_name, event_message, event_id, timestamp, region, account_id):
"""Format RDS event into a rich Slack message with actionable guidance"""

if event_id == 'RDS-EVENT-0189':
title = f"🚨 RDS Burst Balance Alert (0189): {db_name}"
description = (
"*📋 What This Means:*\n"
"The RDS instance has exhausted its burst balance for General Purpose (SSD) storage. "
"Burst balance is a performance credit system that allows temporary spikes in I/O performance. "
"When depleted, the instance will be limited to baseline IOPS, which can cause significant "
"performance degradation and application slowdowns.\n\n"

"*⚠️ Impact:*\n"
"• Database queries may become significantly slower\n"
"• Application response times will increase\n"
"• User experience will be negatively affected\n"
"• If not addressed, the database may become unresponsive during peak loads\n\n"

"*🔧 Resolution Steps:*\n"
"1. *Immediate Action:* Navigate to the RDS Console and review the instance's current storage configuration\n"
"2. *Check Current Setup:*\n"
"   • Review current storage type (likely gp2/gp3)\n"
"   • Check current IOPS configuration\n"
"   • Review CloudWatch metrics for I/O patterns\n"
"3. *Choose Resolution Path:*\n"
"   • *Option A - Upgrade to Provisioned IOPS (io1/io2):* Best for consistent high-performance workloads\n"
"     - Go to Modify → Storage → Change storage type to io1 or io2\n"
"     - Set appropriate Provisioned IOPS (typically 3x-5x your current baseline)\n"
"     - Apply changes during maintenance window or immediately if critical\n"
"   • *Option B - Upgrade to gp3 with higher baseline:* Good balance of cost and performance\n"
"     - Modify storage type to gp3\n"
"     - Increase baseline IOPS (default 3000, can go up to 16000)\n"
"     - Increase throughput if needed (up to 1000 MiB/s)\n"
"4. *Monitor:* After changes, monitor CloudWatch metrics to ensure performance improves\n"
)
elif event_id == 'RDS-EVENT-0225':
title = f"🚨 RDS Storage Threshold Alert (0225): {db_name}"
description = (
"*📋 What This Means:*\n"
"The RDS instance storage has reached 80% of its maximum allocated storage capacity. "
"While RDS supports storage auto-scaling, this threshold indicates that auto-scaling is "
"approaching its configured maximum limit. If storage continues to grow, you may hit the "
"maximum storage limit, which could cause the database to become read-only or unavailable.\n\n"

"*⚠️ Impact:*\n"
"• Database may become read-only if storage reaches 100%\n"
"• New writes will fail, causing application errors\n"
"• Auto-scaling may not be able to expand further if max storage is reached\n"
"• Potential data loss risk if not addressed promptly\n\n"

"*🔧 Resolution Steps:*\n"
"1. *Immediate Assessment:*\n"
"   • Check current storage usage in RDS Console\n"
"   • Review storage auto-scaling configuration\n"
"   • Identify what's consuming storage space\n"
"2. *Investigate Storage Usage:*\n"
"   • Connect to the database and check table sizes\n"
"   • Look for large tables, indexes, or temporary files\n"
"   • Check for unoptimized queries creating large temp tables\n"
"   • Review binary logs, error logs, and slow query logs size\n"
"3. *Immediate Actions (if needed):*\n"
"   • *Increase Max Storage:* Go to Modify → Storage → Increase Max Storage\n"
"     - Increase by at least 20-30% to provide buffer\n"
"     - Consider future growth projections\n"
"   • *Clean Up Space (if applicable):*\n"
"     - Archive or delete old data\n"
"     - Optimize tables (ANALYZE/OPTIMIZE)\n"
"     - Remove unnecessary indexes\n"
"     - Clean up old binary logs (if safe to do so)\n"
"4. *Review Auto-Scaling Settings:*\n"
"   • Ensure auto-scaling is enabled\n"
"   • Verify max storage limit is appropriate for your needs\n"
"   • Consider increasing max storage proactively\n"
)
else:
# Fallback for any other events
title = f"🚨 RDS Event Alert ({event_id}): {db_name}"
description = (
f"*Event Details:*\n"
f"{event_message}\n\n"
"*⚠️ Action Required:* Please review this event and take appropriate action."
)

text_parts = [
f"*Event ID:* `{event_id}`",
f"*Account ID:* `{account_id}`",
f"*RDS Name:* `{db_name}`",
f"*Region:* `{region}`",
f"*Issue:* {event_message}",
"",
description
]

if timestamp:
text_parts.insert(-1, f"*Time:* {timestamp}")

rds_console_link = f"https://{region}.console.aws.amazon.com/rds/home?region={region}#database:id={db_name};is-cluster=false"
cloudwatch_link = f"https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#metricsV2:graph=~();query=AWS/RDS%20{db_name}"

return {
'attachments': [{
'fallback': title,
'color': 'danger',
'title': title,
'text': '\n'.join(text_parts),
'fields': [
{
'title': 'Quick Actions',
'value': f"<{rds_console_link}|View in RDS Console> | <{cloudwatch_link}|View CloudWatch Metrics>",
'short': False
}
],
'footer': 'RDS Important Events Monitor'
}]
}


def post_to_slack(message):
"""Send message to Slack via webhook"""
webhook_url = os.environ.get('SLACK_WEBHOOK_URL')
if not webhook_url:
logger.error("SLACK_WEBHOOK_URL environment variable not set")
return False

try:
response = requests.post(
webhook_url,
data=json.dumps(message),
headers={'Content-Type': 'application/json'},
timeout=10
)

if response.status_code == 200:
logger.info("Successfully posted to Slack")
return True
else:
logger.error(f"Failed to post to Slack: HTTP {response.status_code} - {response.text}")
return False

except Exception as e:
logger.error(f"Exception posting to Slack: {str(e)}")
return False

Dependencies

# rds-important-events-monitor/src/requirements.txt

requests==2.31.0

Setting Up Slack Integration

Create a Slack Incoming Webhook:

Go to your Slack workspace settings
Navigate to Apps → Incoming Webhooks
Create a new webhook for your monitoring channel
Copy the webhook URL

Store the Webhook in SSM Parameter Store:

aws ssm put-parameter \
--name "/us-east-1/rds-monitor/slack-webhook-url" \
--value "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
--type "SecureString" \
--region us-east-1

Deploy with Terraform:

cd terraform/us-east-1
terraform init
terraform plan
terraform apply

What the Slack Alert Looks Like

When an RDS event occurs, your team receives a rich Slack message like this:

🚨 RDS Burst Balance Alert (0189): production-db-instance

Event ID: RDS-EVENT-0189
Account ID: 123456789012
RDS Name: production-db-instance
Region: us-east-1
Issue: Your database instance has exhausted its burst balance...

📋 What This Means:
The RDS instance has exhausted its burst balance for General Purpose (SSD) storage...

⚠️ Impact:
• Database queries may become significantly slower
• Application response times will increase
...

🔧 Resolution Steps:
1. Immediate Action: Navigate to the RDS Console...
2. Check Current Setup: ...
3. Choose Resolution Path: ...

Quick Actions: [View in RDS Console] | [View CloudWatch Metrics]

Key Features:

🎨 Color-coded (red for danger)
📋 Context about what the event means
⚠️ Clear impact assessment
🔧 Step-by-step resolution guide
🔗 Direct links to AWS Console and CloudWatch

Benefits: Why This Matters for BFCM

1. Proactive Problem Detection

Instead of waiting for customer complaints or application metrics to degrade, you’re notified immediately when RDS issues occur. During BFCM, this can save hours of troubleshooting.

2. Actionable Alerts

The alerts don’t just say “something’s wrong”—they provide:

Context about what’s happening
Impact assessment
Step-by-step resolution instructions
Direct links to relevant AWS resources

3. Team Collaboration

Slack notifications ensure the right people see the issue immediately. No more checking CloudWatch dashboards or waiting for automated reports.

4. Cost-Effective

Lambda: Pay only for invocations (typically $0.20 per million requests)
EventBridge: First 5 million events/month are free
Total monthly cost: Usually under $1 for most workloads

5. Scalable Architecture

The system automatically handles:

Multiple RDS instances
High event volumes during peak traffic
Regional deployments

6. Infrastructure as Code

Everything is version-controlled and reproducible:

Easy to deploy to new environments
Simple to modify and extend
No manual configuration drift

Real-World Impact

Before: During a previous BFCM event, we discovered RDS performance issues 2 hours after they started, leading to:

3-hour incident response time

After: With this monitoring system:

Issues detected within 30 seconds
Team notified immediately via Slack
Resolution time reduced to 15 minutes
Zero customer-facing impact

Extending the Solution

This pattern can be extended to monitor other critical AWS events:

# Example: Add ECS service deployment failures
resource "aws_cloudwatch_event_rule" "ecs_deployment_failures" {
name = "ecs-deployment-failures"

event_pattern = jsonencode({
source      = ["aws.ecs"]
detail-type = ["ECS Deployment State Change"]
detail = {
eventName = ["SERVICE_DEPLOYMENT_FAILED"]
}
})
}

Or add more RDS events:

detail = {
EventID = [
"RDS-EVENT-0189",  # Burst balance
"RDS-EVENT-0225",  # Storage threshold
"RDS-EVENT-0169",  # DB instance restart
"RDS-EVENT-0151"   # DB instance availability
]
}

Best Practices

Test Your Alerts: Use AWS CLI to simulate events:

aws events put-events --entries '[
{
"Source": "aws.rds",
"DetailType": "RDS DB Instance Event",
"Detail": "{\"EventID\":\"RDS-EVENT-0189\",\"SourceIdentifier\":\"test-db\",\"Message\":\"Test event\"}"
}
]'

Monitor the Monitor: Set up CloudWatch alarms for Lambda errors
Rotate Secrets: Regularly rotate your Slack webhook URL
Review Logs: Periodically review CloudWatch logs for patterns
Update Documentation: Keep resolution steps current as your infrastructure evolves

Conclusion

Proactive monitoring isn’t just about catching problems—it’s about preventing them from impacting your customers. During critical events like BFCM, every second counts.

This EventBridge + Lambda + Slack solution gives you:

✅ Immediate notification of RDS issues
✅ Actionable, context-rich alerts
✅ Zero infrastructure management
✅ Cost-effective scaling
✅ Infrastructure as Code best practices

The result? A monitoring system that works silently in the background, alerting your team only when action is needed, with all the context they need to resolve issues quickly.

Resources

Have you built similar monitoring solutions? Share your experiences in the comments below! 🚀

The Problem: Silent Database Failures During Peak Traffic

Why Traditional Monitoring Falls Short

The Problem: Silent Database Failures During Peak Traffic

Why Traditional Monitoring Falls Short

The Solution: EventBridge + Lambda + Slack = Proactive Alerts

Architecture Overview

Implementation: Infrastructure as Code

Step 1: Create the Terraform Module

Step 2: Module Variables

Step 3: Use the Module

The Lambda Function: Intelligent Alert Formatting

Complete Lambda Code

Dependencies

Setting Up Slack Integration

What the Slack Alert Looks Like

Benefits: Why This Matters for BFCM

1. Proactive Problem Detection

2. Actionable Alerts

3. Team Collaboration

4. Cost-Effective

5. Scalable Architecture

6. Infrastructure as Code

Real-World Impact

Extending the Solution

Best Practices

Conclusion

Resources

Similar Posts