Stop database issues before they become customer-facing problems
The Problem: Silent Database Failures During Peak Traffic
Picture this: Itβs Black Friday, and your e-commerce platform is handling 10x the normal traffic. Everything seems fineβyour application servers are scaling beautifully, your CDN is caching like a champ, and your monitoring dashboards show green across the board.
Then, suddenly, customer complaints start flooding in. Orders are timing out. Checkout pages are loading slowly. Your database is silently struggling, but you donβt know it yet.
The culprit? RDS performance degradation that wasnβt being caught by traditional monitoring.
Why Traditional Monitoring Falls Short
AWS RDS generates events for critical issues, but these events often go unnotβ¦
Stop database issues before they become customer-facing problems
The Problem: Silent Database Failures During Peak Traffic
Picture this: Itβs Black Friday, and your e-commerce platform is handling 10x the normal traffic. Everything seems fineβyour application servers are scaling beautifully, your CDN is caching like a champ, and your monitoring dashboards show green across the board.
Then, suddenly, customer complaints start flooding in. Orders are timing out. Checkout pages are loading slowly. Your database is silently struggling, but you donβt know it yet.
The culprit? RDS performance degradation that wasnβt being caught by traditional monitoring.
Why Traditional Monitoring Falls Short
AWS RDS generates events for critical issues, but these events often go unnoticed:
- RDS-EVENT-0189: Burst balance exhaustion (IOPS performance degradation)
- RDS-EVENT-0225: Storage threshold at 80% (approaching auto-scaling limits)
These events are published to EventBridge, but without proper routing, theyβre just... sitting there. By the time you notice performance degradation in your application metrics, itβs already impacting users.
During high-traffic events like Black Friday Cyber Monday (BFCM), this delay can mean the difference between a successful sale and a lost customer.
The Solution: EventBridge + Lambda + Slack = Proactive Alerts
We built a lightweight, serverless monitoring system that:
β Catches issues immediately - EventBridge triggers Lambda within seconds of RDS events
β Provides actionable context - Rich Slack notifications with step-by-step resolution guides
β Scales automatically - No infrastructure to manage, pay-per-use pricing
β Integrates seamlessly - Works with existing Terraform infrastructure
Architecture Overview
βββββββββββββββ
β AWS RDS β ββ[Event]ββ> ββββββββββββββββ
β Instance β β EventBridge β
βββββββββββββββ ββββββββ¬ββββββββ
β
β Triggers
βΌ
βββββββββββββββββββ
β Lambda Functionβ
β (Python 3.12) β
βββββββββ¬ββββββββββ
β
β Formats & Sends
βΌ
βββββββββββββββββββ
β Slack Channel β
β (Rich Alerts) β
βββββββββββββββββββ
Key Components:
- EventBridge Rule: Filters for RDS-EVENT-0189 and RDS-EVENT-0225
- Lambda Function: Processes events and formats Slack messages
- SSM Parameter Store: Securely stores Slack webhook URL
- CloudWatch Logs: Captures all Lambda execution logs
Implementation: Infrastructure as Code
Letβs build this step by step. Weβll use Terraform for infrastructure and Python for the Lambda function.
Step 1: Create the Terraform Module
First, letβs create a reusable Terraform module:
# terraform/modules/rds-important-events-monitor/main.tf
resource "aws_iam_role" "rds_important_events_monitor" {
count = var.create_iam_role ? 1 : 0
name = "${var.function_name}-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "lambda_logs" {
count = var.create_iam_role ? 1 : 0
name = "${var.function_name}-logs-policy"
role = aws_iam_role.rds_important_events_monitor[0].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:${data.aws_caller_identity.current.account_id}:*"
}
]
})
}
resource "aws_iam_role_policy" "ssm_read" {
count = var.create_iam_role ? 1 : 0
name = "${var.function_name}-ssm-policy"
role = aws_iam_role.rds_important_events_monitor[0].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ssm:GetParameter",
"ssm:GetParameters"
]
Resource = "arn:aws:ssm:${var.aws_region}:${data.aws_caller_identity.current.account_id}:parameter${var.slack_webhook_url_parameter}"
}
]
})
}
resource "aws_lambda_function" "rds_important_events_monitor" {
depends_on = [
aws_cloudwatch_log_group.lambda_logs
]
function_name = var.function_name
description = "Monitor RDS issues and send Slack notifications"
role = var.create_iam_role ? aws_iam_role.rds_important_events_monitor[0].arn : var.lambda_role_arn
handler = var.handler
runtime = var.runtime
timeout = var.timeout
memory_size = var.memory_size
filename = "${var.source_path}/../lambda_function.zip"
source_code_hash = filebase64sha256("${var.source_path}/../lambda_function.zip")
environment {
variables = {
SLACK_WEBHOOK_URL = data.aws_ssm_parameter.slack_webhook_url.value
}
}
tags = {
Name = var.function_name
Description = "Monitors RDS events and sends Slack notifications"
}
}
resource "aws_cloudwatch_log_group" "lambda_logs" {
name = "/aws/lambda/${var.function_name}"
retention_in_days = var.cloudwatch_logs_retention_in_days
}
# EventBridge Rule to capture RDS events
resource "aws_cloudwatch_event_rule" "rds_iops_events" {
name = "${var.function_name}-events"
description = "Capture RDS events RDS-EVENT-0189 and RDS-EVENT-0225 for IOPS-related/StorageSize issues"
event_pattern = jsonencode({
source = ["aws.rds"]
detail-type = ["RDS DB Instance Event"]
detail = {
EventID = [
"RDS-EVENT-0189",
"RDS-EVENT-0225"
]
}
})
tags = {
Name = "${var.function_name}-events"
}
}
# Connect EventBridge to Lambda
resource "aws_cloudwatch_event_target" "lambda" {
target_id = "RDSIOPSMonitorLambdaTarget"
rule = aws_cloudwatch_event_rule.rds_iops_events.name
arn = aws_lambda_function.rds_important_events_monitor.arn
}
# Allow EventBridge to invoke Lambda
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id = "AllowExecutionFromEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.rds_important_events_monitor.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.rds_iops_events.arn
}
# Data source to fetch Slack webhook from SSM
data "aws_ssm_parameter" "slack_webhook_url" {
name = var.slack_webhook_url_parameter
}
data "aws_caller_identity" "current" {}
Step 2: Module Variables
# terraform/modules/rds-important-events-monitor/vars.tf
variable "function_name" {
description = "Name of the Lambda function"
type = string
default = "rds-important-events-monitor"
}
variable "runtime" {
description = "Lambda runtime"
type = string
default = "python3.12"
}
variable "source_path" {
description = "Path to the Lambda source code directory"
type = string
}
variable "handler" {
description = "Lambda handler"
type = string
default = "app.lambda_handler"
}
variable "timeout" {
description = "Lambda timeout in seconds"
type = number
default = 30
}
variable "memory_size" {
description = "Lambda memory size in MB"
type = number
default = 256
}
variable "slack_webhook_url_parameter" {
description = "SSM Parameter Store path for Slack webhook URL"
type = string
default = "/us-east-1/rds-monitor/slack-webhook-url"
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "cloudwatch_logs_retention_in_days" {
description = "CloudWatch Logs retention in days"
type = number
default = 7
}
variable "create_iam_role" {
description = "Whether to create an IAM role for the Lambda function"
type = bool
default = true
}
variable "lambda_role_arn" {
description = "ARN of an existing IAM role to use for the Lambda function"
type = string
default = null
}
Step 3: Use the Module
# terraform/us-east-1/rds-important-events-monitor.tf
module "rds_important_events_monitor" {
source = "../modules/rds-important-events-monitor"
source_path = "../../rds-important-events-monitor/src"
create_iam_role = false
lambda_role_arn = "arn:aws:iam::YOUR_ACCOUNT_ID:role/rds-important-events-monitor"
}
The Lambda Function: Intelligent Alert Formatting
The Lambda function does more than just forward eventsβit provides actionable, context-rich notifications that help your team respond quickly.
Complete Lambda Code
# rds-important-events-monitor/src/app.py
import json
import os
import logging
import requests
logger = logging.getLogger()
logger.setLevel(logging.INFO)
aws_region = os.environ.get('AWS_REGION', 'us-east-1')
def lambda_handler(event, context):
logger.info(f"Received event: {json.dumps(event)}")
try:
detail = event.get('detail', {})
source_identifier = detail.get('SourceIdentifier', '')
event_message = detail.get('Message', '')
event_id = detail.get('EventID', '')
timestamp = event.get('time', '')
region = event.get('region', aws_region)
account_id = event.get('account', '')
expected_events = ['RDS-EVENT-0189', 'RDS-EVENT-0225']
if event_id not in expected_events:
logger.warning(f"Received unexpected event ID: {event_id}, expected one of {expected_events}")
return {
'statusCode': 200,
'body': json.dumps(f'Unexpected event ID: {event_id}')
}
db_name = source_identifier
if not db_name:
logger.warning("Could not extract DB name from event")
return {
'statusCode': 200,
'body': json.dumps('Event processed but no DB name found')
}
logger.info(f"Processing {event_id} for database: {db_name}")
message = format_rds_event_slack_message(
db_name=db_name,
event_message=event_message,
event_id=event_id,
timestamp=timestamp,
region=region,
account_id=account_id
)
success = post_to_slack(message)
if success:
logger.info(f"Successfully sent Slack notification for {db_name}")
else:
logger.error(f"Failed to send Slack notification for {db_name}")
return {
'statusCode': 200 if success else 500,
'body': json.dumps({
'message': 'Notification sent' if success else 'Notification failed',
'db_name': db_name,
'event_message': event_message
})
}
except Exception as e:
logger.error(f"Error processing event: {str(e)}", exc_info=True)
return {
'statusCode': 500,
'body': json.dumps(f'Error: {str(e)}')
}
def format_rds_event_slack_message(db_name, event_message, event_id, timestamp, region, account_id):
"""Format RDS event into a rich Slack message with actionable guidance"""
if event_id == 'RDS-EVENT-0189':
title = f"π¨ RDS Burst Balance Alert (0189): {db_name}"
description = (
"*π What This Means:*\n"
"The RDS instance has exhausted its burst balance for General Purpose (SSD) storage. "
"Burst balance is a performance credit system that allows temporary spikes in I/O performance. "
"When depleted, the instance will be limited to baseline IOPS, which can cause significant "
"performance degradation and application slowdowns.\n\n"
"*β οΈ Impact:*\n"
"β’ Database queries may become significantly slower\n"
"β’ Application response times will increase\n"
"β’ User experience will be negatively affected\n"
"β’ If not addressed, the database may become unresponsive during peak loads\n\n"
"*π§ Resolution Steps:*\n"
"1. *Immediate Action:* Navigate to the RDS Console and review the instance's current storage configuration\n"
"2. *Check Current Setup:*\n"
" β’ Review current storage type (likely gp2/gp3)\n"
" β’ Check current IOPS configuration\n"
" β’ Review CloudWatch metrics for I/O patterns\n"
"3. *Choose Resolution Path:*\n"
" β’ *Option A - Upgrade to Provisioned IOPS (io1/io2):* Best for consistent high-performance workloads\n"
" - Go to Modify β Storage β Change storage type to io1 or io2\n"
" - Set appropriate Provisioned IOPS (typically 3x-5x your current baseline)\n"
" - Apply changes during maintenance window or immediately if critical\n"
" β’ *Option B - Upgrade to gp3 with higher baseline:* Good balance of cost and performance\n"
" - Modify storage type to gp3\n"
" - Increase baseline IOPS (default 3000, can go up to 16000)\n"
" - Increase throughput if needed (up to 1000 MiB/s)\n"
"4. *Monitor:* After changes, monitor CloudWatch metrics to ensure performance improves\n"
)
elif event_id == 'RDS-EVENT-0225':
title = f"π¨ RDS Storage Threshold Alert (0225): {db_name}"
description = (
"*π What This Means:*\n"
"The RDS instance storage has reached 80% of its maximum allocated storage capacity. "
"While RDS supports storage auto-scaling, this threshold indicates that auto-scaling is "
"approaching its configured maximum limit. If storage continues to grow, you may hit the "
"maximum storage limit, which could cause the database to become read-only or unavailable.\n\n"
"*β οΈ Impact:*\n"
"β’ Database may become read-only if storage reaches 100%\n"
"β’ New writes will fail, causing application errors\n"
"β’ Auto-scaling may not be able to expand further if max storage is reached\n"
"β’ Potential data loss risk if not addressed promptly\n\n"
"*π§ Resolution Steps:*\n"
"1. *Immediate Assessment:*\n"
" β’ Check current storage usage in RDS Console\n"
" β’ Review storage auto-scaling configuration\n"
" β’ Identify what's consuming storage space\n"
"2. *Investigate Storage Usage:*\n"
" β’ Connect to the database and check table sizes\n"
" β’ Look for large tables, indexes, or temporary files\n"
" β’ Check for unoptimized queries creating large temp tables\n"
" β’ Review binary logs, error logs, and slow query logs size\n"
"3. *Immediate Actions (if needed):*\n"
" β’ *Increase Max Storage:* Go to Modify β Storage β Increase Max Storage\n"
" - Increase by at least 20-30% to provide buffer\n"
" - Consider future growth projections\n"
" β’ *Clean Up Space (if applicable):*\n"
" - Archive or delete old data\n"
" - Optimize tables (ANALYZE/OPTIMIZE)\n"
" - Remove unnecessary indexes\n"
" - Clean up old binary logs (if safe to do so)\n"
"4. *Review Auto-Scaling Settings:*\n"
" β’ Ensure auto-scaling is enabled\n"
" β’ Verify max storage limit is appropriate for your needs\n"
" β’ Consider increasing max storage proactively\n"
)
else:
# Fallback for any other events
title = f"π¨ RDS Event Alert ({event_id}): {db_name}"
description = (
f"*Event Details:*\n"
f"{event_message}\n\n"
"*β οΈ Action Required:* Please review this event and take appropriate action."
)
text_parts = [
f"*Event ID:* `{event_id}`",
f"*Account ID:* `{account_id}`",
f"*RDS Name:* `{db_name}`",
f"*Region:* `{region}`",
f"*Issue:* {event_message}",
"",
description
]
if timestamp:
text_parts.insert(-1, f"*Time:* {timestamp}")
rds_console_link = f"https://{region}.console.aws.amazon.com/rds/home?region={region}#database:id={db_name};is-cluster=false"
cloudwatch_link = f"https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#metricsV2:graph=~();query=AWS/RDS%20{db_name}"
return {
'attachments': [{
'fallback': title,
'color': 'danger',
'title': title,
'text': '\n'.join(text_parts),
'fields': [
{
'title': 'Quick Actions',
'value': f"<{rds_console_link}|View in RDS Console> | <{cloudwatch_link}|View CloudWatch Metrics>",
'short': False
}
],
'footer': 'RDS Important Events Monitor'
}]
}
def post_to_slack(message):
"""Send message to Slack via webhook"""
webhook_url = os.environ.get('SLACK_WEBHOOK_URL')
if not webhook_url:
logger.error("SLACK_WEBHOOK_URL environment variable not set")
return False
try:
response = requests.post(
webhook_url,
data=json.dumps(message),
headers={'Content-Type': 'application/json'},
timeout=10
)
if response.status_code == 200:
logger.info("Successfully posted to Slack")
return True
else:
logger.error(f"Failed to post to Slack: HTTP {response.status_code} - {response.text}")
return False
except Exception as e:
logger.error(f"Exception posting to Slack: {str(e)}")
return False
Dependencies
# rds-important-events-monitor/src/requirements.txt
requests==2.31.0
Setting Up Slack Integration
Create a Slack Incoming Webhook:
- Go to your Slack workspace settings
- Navigate to Apps β Incoming Webhooks
- Create a new webhook for your monitoring channel
- Copy the webhook URL
Store the Webhook in SSM Parameter Store:
aws ssm put-parameter \
--name "/us-east-1/rds-monitor/slack-webhook-url" \
--value "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
--type "SecureString" \
--region us-east-1
- Deploy with Terraform:
cd terraform/us-east-1
terraform init
terraform plan
terraform apply
What the Slack Alert Looks Like
When an RDS event occurs, your team receives a rich Slack message like this:
π¨ RDS Burst Balance Alert (0189): production-db-instance
Event ID: RDS-EVENT-0189
Account ID: 123456789012
RDS Name: production-db-instance
Region: us-east-1
Issue: Your database instance has exhausted its burst balance...
π What This Means:
The RDS instance has exhausted its burst balance for General Purpose (SSD) storage...
β οΈ Impact:
β’ Database queries may become significantly slower
β’ Application response times will increase
...
π§ Resolution Steps:
1. Immediate Action: Navigate to the RDS Console...
2. Check Current Setup: ...
3. Choose Resolution Path: ...
Quick Actions: [View in RDS Console] | [View CloudWatch Metrics]
Key Features:
- π¨ Color-coded (red for danger)
- π Context about what the event means
- β οΈ Clear impact assessment
- π§ Step-by-step resolution guide
- π Direct links to AWS Console and CloudWatch
Benefits: Why This Matters for BFCM
1. Proactive Problem Detection
Instead of waiting for customer complaints or application metrics to degrade, youβre notified immediately when RDS issues occur. During BFCM, this can save hours of troubleshooting.
2. Actionable Alerts
The alerts donβt just say βsomethingβs wrongββthey provide:
- Context about whatβs happening
- Impact assessment
- Step-by-step resolution instructions
- Direct links to relevant AWS resources
3. Team Collaboration
Slack notifications ensure the right people see the issue immediately. No more checking CloudWatch dashboards or waiting for automated reports.
4. Cost-Effective
- Lambda: Pay only for invocations (typically $0.20 per million requests)
- EventBridge: First 5 million events/month are free
- Total monthly cost: Usually under $1 for most workloads
5. Scalable Architecture
The system automatically handles:
- Multiple RDS instances
- High event volumes during peak traffic
- Regional deployments
6. Infrastructure as Code
Everything is version-controlled and reproducible:
- Easy to deploy to new environments
- Simple to modify and extend
- No manual configuration drift
Real-World Impact
Before: During a previous BFCM event, we discovered RDS performance issues 2 hours after they started, leading to:
- 3-hour incident response time
After: With this monitoring system:
- Issues detected within 30 seconds
- Team notified immediately via Slack
- Resolution time reduced to 15 minutes
- Zero customer-facing impact
Extending the Solution
This pattern can be extended to monitor other critical AWS events:
# Example: Add ECS service deployment failures
resource "aws_cloudwatch_event_rule" "ecs_deployment_failures" {
name = "ecs-deployment-failures"
event_pattern = jsonencode({
source = ["aws.ecs"]
detail-type = ["ECS Deployment State Change"]
detail = {
eventName = ["SERVICE_DEPLOYMENT_FAILED"]
}
})
}
Or add more RDS events:
detail = {
EventID = [
"RDS-EVENT-0189", # Burst balance
"RDS-EVENT-0225", # Storage threshold
"RDS-EVENT-0169", # DB instance restart
"RDS-EVENT-0151" # DB instance availability
]
}
Best Practices
- Test Your Alerts: Use AWS CLI to simulate events:
aws events put-events --entries '[
{
"Source": "aws.rds",
"DetailType": "RDS DB Instance Event",
"Detail": "{\"EventID\":\"RDS-EVENT-0189\",\"SourceIdentifier\":\"test-db\",\"Message\":\"Test event\"}"
}
]'
- Monitor the Monitor: Set up CloudWatch alarms for Lambda errors
- Rotate Secrets: Regularly rotate your Slack webhook URL
- Review Logs: Periodically review CloudWatch logs for patterns
- Update Documentation: Keep resolution steps current as your infrastructure evolves
Conclusion
Proactive monitoring isnβt just about catching problemsβitβs about preventing them from impacting your customers. During critical events like BFCM, every second counts.
This EventBridge + Lambda + Slack solution gives you:
- β Immediate notification of RDS issues
- β Actionable, context-rich alerts
- β Zero infrastructure management
- β Cost-effective scaling
- β Infrastructure as Code best practices
The result? A monitoring system that works silently in the background, alerting your team only when action is needed, with all the context they need to resolve issues quickly.
Resources
- AWS EventBridge Documentation
- RDS Event Categories and Event Messages
- Slack Incoming Webhooks
- Terraform AWS Provider
Have you built similar monitoring solutions? Share your experiences in the comments below! π