CloudWatch Alarm Triage with AWS Bedrock
Created by Wayne Workman
Overview
A reusable Terraform module that integrates AWS CloudWatch Alarms with AWS Bedrock models in agent mode to automatically investigate and triage alarms. The AI model operates as an agent with access to a Python execution tool that can run boto3 operations and analysis code with pre-imported libraries, providing deep investigation capabilities. This solution provides engineers with comprehensive contextual information and preliminary analysis before they respond to incidents, significantly reducing mean time to resolution (MTTR).
Model Compatibility:
- Works with all AWS Bedrock foundational models (Claude, Nova, Titan, etc.)
- Defaults to Claude Sonnet 4.5 (β¦
CloudWatch Alarm Triage with AWS Bedrock
Created by Wayne Workman
Overview
A reusable Terraform module that integrates AWS CloudWatch Alarms with AWS Bedrock models in agent mode to automatically investigate and triage alarms. The AI model operates as an agent with access to a Python execution tool that can run boto3 operations and analysis code with pre-imported libraries, providing deep investigation capabilities. This solution provides engineers with comprehensive contextual information and preliminary analysis before they respond to incidents, significantly reducing mean time to resolution (MTTR).
Model Compatibility:
- Works with all AWS Bedrock foundational models (Claude, Nova, Titan, etc.)
- Defaults to Claude Sonnet 4.5 (
global.anthropic.claude-sonnet-4-5-20250929-v1:0) - currently the best performing model while also being fast and cost-effective - Smaller/cheaper models are less capable; newer/larger models are more capable
- Use the
global.orus.prefix to enable cross-region inference - Module uses the region of your calling Terraform provider
Key Features
- System Inference Profile: Uses AWS-managed inference profile for reliable model invocation
- Robust Error Handling: Automatic retries with exponential backoff for API timeouts
- 5-Minute Read Timeout: Extended timeout for handling complex investigations
- 100 Tool Call Iterations: Supports thorough multi-step investigations
- DynamoDB Deduplication: Prevents duplicate investigations with configurable time window
- Pre-imported Python Modules: Fast execution with 40+ pre-imported Python libraries
- Concurrent Execution Control: Prevents overlapping investigations
- Security-First Import Control: Automatic import statement removal ensures only authorized pre-imported modules are used
- Enhanced Visibility: Full context tracking of all AI model interactions with detailed S3 reporting
- Configurable Logging: Environment-based logging levels (ERROR/INFO/DEBUG) for production optimization
Architecture
Two-Lambda Design
This module creates two Lambda functions working together:
Orchestrator Lambda (Python 3.13)
- Receives CloudWatch Alarm events
- Invokes Bedrock model in agent mode
- Sends investigation results to SNS
- Minimal IAM permissions (Bedrock, SNS, Logs)
Tool Lambda (Python 3.13)
- Called by the AI model as a tool during investigation
- Executes Python code with pre-imported modules
- Uses AWS managed
ReadOnlyAccesspolicy with deny statements - Prevents access to sensitive data (S3 objects, DynamoDB data, secrets)
- All standard library and AWS SDK modules pre-imported for performance
Workflow
CloudWatch Alarm (ALARM state)
β
Orchestrator Lambda
β
DynamoDB Deduplication Check
β (if not recently investigated)
Bedrock Model (agent mode)
β (multiple tool calls)
Tool Lambda (Python executor)
β (returns findings)
AI Analysis & Root Cause
β
Save Report to S3 Bucket
β
SNS Email Notification
Deduplication
The module uses DynamoDB to prevent duplicate investigations of the same alarm within a configurable time window (default: 1 hour). This prevents multiple emails when CloudWatch continuously evaluates an alarm in ALARM state. The DynamoDB entries automatically expire using TTL.
Investigation Reports Storage
All investigation reports are automatically saved to an S3 bucket with enhanced visibility features:
File Organization
-
Date-based structure:
reports/YYYY/MM/DD/ -
Timestamp-first naming:
YYYYMMDD_HHMMSS_UTC_{alarm_name}_{type}.{ext} -
Three files per investigation:
-
*_report.txt- Human-readable investigation report -
*_full_context.txt- Complete AI model conversation history -
*.json- Structured data with all metadata
Storage Features
- Encryption: Server-side encryption with AES256
- Versioning: Enabled for audit trail
- Access Control: Public access blocked with bucket ACLs
- Optional Logging: Configure access logging to track report access
- Optional Lifecycle: Auto-delete old reports after specified days
- Naming Convention:
{prefix}-alarm-reports-{random}for bucket uniqueness
Enhanced Visibility Data
Each investigation captures:
- Iteration count: Total number of Bedrock API invocations
- Tool call history: Complete record of all Python code executions
- Full conversation context: Every interaction between orchestrator and AI model
- Timestamps: Precise UTC timestamps for all operations
- Alarm metadata: Complete CloudWatch alarm event data
Module Structure
cloudwatch-alarm-triage/
βββ main.tf # Core module resources
βββ variables.tf # Module input variables
βββ outputs.tf # Module outputs
βββ versions.tf # Provider version constraints
βββ lambda/
β βββ triage_handler.py # Main Lambda function
β βββ bedrock_client.py # Bedrock integration
β βββ prompt_template.py # AI prompt template
β βββ requirements.txt # Python dependencies
βββ tool-lambda/
β βββ tool_handler.py # Python executor handler
βββ tests/
β βββ unit/ # Unit tests (204 tests)
β βββ integration/ # Integration tests (7 tests)
β βββ conftest.py # Pytest configuration
βββ demo/ # Complete working example
β βββ main.tf # Demo deployment
β βββ failing_lambda.tf # Intentionally failing Lambda
β βββ alarms.tf # CloudWatch alarm config
β βββ lambda_code/ # Demo Lambda code
β βββ README.md # Demo documentation
β βββ DEMO_SUMMARY.md # Technical summary
βββ README.md # This file
Quick Start
1. Deploy the Module
# Configure provider for region with Bedrock models
provider "aws" {
region = "us-east-2" # Ensure Bedrock models are available in your region
}
# Create SNS topic for notifications
resource "aws_sns_topic" "alarm_notifications" {
name = "cloudwatch-alarm-investigations"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alarm_notifications.arn
protocol = "email"
endpoint = "your-email@example.com"
}
# Deploy the triage module
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"
sns_topic_arn = aws_sns_topic.alarm_notifications.arn
# Optional: Override the default model (Claude Opus 4.1 with cross-region inference)
# bedrock_model_id = "us.amazon.nova-premier-v1:0" # For cost optimization
# Optional: Configure logging level
# log_level = "DEBUG" # Options: ERROR, INFO (default), DEBUG
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
2. Configure CloudWatch Alarms
Add the triage Lambda as an action on your CloudWatch alarms:
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
alarm_name = "lambda-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "Errors"
namespace = "AWS/Lambda"
period = "60"
statistic = "Sum"
threshold = "10"
alarm_description = "Triggers when Lambda errors exceed threshold"
dimensions = {
FunctionName = aws_lambda_function.my_function.function_name
}
# Add triage Lambda as alarm action
alarm_actions = [
module.alarm_triage.triage_lambda_arn
]
}
3. Confirm Email Subscription
Check your email and confirm the SNS subscription to receive investigation results.
Module Inputs
| Variable | Description | Type | Default |
|---|---|---|---|
log_level | Logging level for Lambda functions (ERROR, INFO, DEBUG) | string | "INFO" |
sns_topic_arn | SNS topic ARN for sending investigation results | string | Required |
bedrock_model_id | Bedrock model identifier | string | "global.anthropic.claude-sonnet-4-5-20250929-v1:0" |
lambda_timeout | Timeout for orchestrator Lambda in seconds | number | 900 |
lambda_memory_size | Memory for orchestrator Lambda in MB | number | 1024 |
tool_lambda_timeout | Timeout for tool Lambda in seconds | number | 60 |
tool_lambda_memory_size | Memory for tool Lambda in MB | number | 2048 |
tool_lambda_reserved_concurrency | Reserved concurrent executions for tool Lambda | number | -1 (unreserved) |
investigation_window_hours | Hours before re-investigating same alarm | number | 1 |
resource_prefix | Prefix for all created resources | string | "" |
resource_suffix | Suffix for all created resources | string | "" |
tags | Tags to apply to all resources | map(string) | {} |
Module Outputs
| Output | Description |
|---|---|
triage_lambda_arn | ARN of the triage Lambda function |
triage_lambda_name | Name of the triage Lambda function |
tool_lambda_arn | ARN of the tool Lambda function |
tool_lambda_name | Name of the tool Lambda function |
triage_lambda_log_group | CloudWatch Logs group for the triage Lambda |
tool_lambda_log_group | CloudWatch Logs group for the tool Lambda |
bedrock_model_id | The Bedrock model ID being used |
dynamodb_table_name | Name of the DynamoDB table for deduplication |
Investigation Capabilities
The AI model can investigate alarms by executing Python code with pre-imported modules:
Security-First Import Control
The tool Lambda implements automatic import statement removal as a critical security and control feature. This ensures that only authorized, pre-imported modules can be used in the execution environment, preventing:
- Unauthorized module usage - No ability to import unapproved libraries
- Supply chain attacks - Cannot introduce external dependencies
- Resource exhaustion - Prevents importing resource-intensive modules
- Data exfiltration - Blocks attempts to import networking libraries not pre-approved
When the AI model includes import statements, the system automatically:
- Detects and removes all import statements using AST parsing
- Logs the removed imports for security audit trails
- Continues execution with only the pre-authorized modules
This security control also provides compatibility benefits, allowing the module to work reliably with various AI model tiers. The system transparently handles code like:
import boto3
import datetime
import json
# Your investigation code here...
The system automatically strips these imports and executes the code with only the pre-authorized modules, maintaining strict security boundaries while ensuring functionality.
Pre-Imported Modules Available
Core AWS & Data
- boto3 - AWS SDK for all AWS operations
- json - JSON encoding/decoding
- csv - CSV file operations
- base64 - Base64 encoding/decoding
Date & Time
- datetime - Full datetime module (use datetime.datetime, datetime.timedelta, etc.)
- time - Time-related functions
Text & Pattern Matching
- re - Regular expressions
- string - String constants and utilities
- textwrap - Text wrapping and filling
- difflib - Helpers for computing differences
- fnmatch - Unix-style pattern matching
- glob - Unix-style pathname pattern expansion
Data Structures & Algorithms
- collections - Counter, defaultdict, OrderedDict, etc.
- itertools - Functions for creating iterators
- functools - Higher-order functions
- operator - Standard operators as functions
- copy - Shallow and deep copy operations
Network & Security
- ipaddress - IP network/address manipulation
- hashlib - Secure hash algorithms
- urllib - URL handling modules
- uuid - UUID generation
Math & Statistics
- math - Mathematical functions
- statistics - Statistical functions
- random - Random number generation
- decimal - Decimal arithmetic
- fractions - Rational number arithmetic
System & Utility
- os - Operating system interface (limited in Lambda)
- sys - System-specific parameters
- platform - Platform identification
- traceback - Traceback utilities
- warnings - Warning control
- pprint - Pretty printer
Type Hints & Data Classes
- enum - Support for enumerations
- dataclasses - Data class support
- typing - Type hints support
I/O Operations
- StringIO - In-memory text streams
- BytesIO - In-memory byte streams
Compression
- gzip - Gzip compression
- zlib - Compression library
- tarfile - Tar archive access
- zipfile - ZIP archive access
Investigation Examples
- CloudWatch Logs Analysis - Filter and analyze application logs
- Metric Statistics - Review trends and anomalies
- IAM Permissions - Check roles and policies
- CloudTrail Events - Find recent API calls
- EC2 Instances - Examine infrastructure state
- Lambda Functions - Review configurations and errors
- Complex Analysis - Pattern detection, cost optimization, multi-resource correlation
Security Model
The tool Lambda uses:
-
AWS managed ReadOnlyAccess policy for comprehensive resource inspection
-
Explicit deny statements for sensitive data:
-
S3 object content (can list, cannot read)
-
DynamoDB data (can describe, cannot query)
-
Secrets Manager values (can list, cannot retrieve)
-
Parameter Store SecureString parameters (can list, cannot decrypt)
Example Investigation Output
When a Lambda function alarm triggers, the AI model might provide:
π¨ EXECUTIVE SUMMARY
Lambda function prod-api-handler experiencing 100% error rate due to missing DynamoDB table permissions. Immediate action required: Add dynamodb:GetItem permission to Lambda role.
π INVESTIGATION DETAILS
Python code executed:
CloudWatch Logs analysis:
logs = boto3.client('logs')
response = logs.filter_log_events(
logGroupName='/aws/lambda/prod-api-handler',
startTime=int((datetime.datetime.now() - datetime.timedelta(minutes=5)).timestamp() * 1000)
)
- Found 47 AccessDenied errors in past 5 minutes
- All errors: βUser: arn:aws:sts::123456789012:assumed-role/lambda-role/prod-api-handler is not authorized to perform: dynamodb:GetItem on resource: arn:aws:dynamodb:us-east-2:123456789012:table/UserDataβ
IAM role policy check:
iam = boto3.client('iam')
policies = iam.list_attached_role_policies(RoleName='lambda-role')
- Only AWSLambdaBasicExecutionRole attached
- No DynamoDB permissions found
DynamoDB table verification:
dynamodb = boto3.client('dynamodb')
table = dynamodb.describe_table(TableName='UserData')
- Table exists and is ACTIVE
- No resource-level restrictions
π ROOT CAUSE
Lambda role lacks dynamodb:GetItem permission for the UserData table. This occurred after the recent IAM policy update that removed the overly permissive * resource access.
π₯ IMPACT ASSESSMENT
- Affected Resources: prod-api-handler Lambda function
- Business Impact: API completely unavailable, all requests failing
- Severity Level: Critical
- Users Affected: All users (estimated 5,000+ active)
π§ IMMEDIATE ACTIONS
Add DynamoDB permissions to Lambda role:
aws iam attach-role-policy \
--role-name lambda-role \
--policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess
Time estimate: 2 minutes 1.
Verify function recovery:
- Monitor CloudWatch metrics for error rate drop
- Test API endpoints manually Time estimate: 5 minutes
π‘οΈ PREVENTION MEASURES
- Implement least-privilege IAM policies with explicit resource ARNs
- Add pre-deployment IAM policy validation
- Create Lambda function tests that verify DynamoDB access
π MONITORING RECOMMENDATIONS
- Set alarm threshold to 5 errors (current: 10)
- Add custom metric for DynamoDB throttling
- Create dashboard showing Lambda errors by error type
Advanced Configuration
Resource Naming
Control resource names with prefixes and suffixes:
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"
resource_prefix = "prod"
resource_suffix = "us-east-2"
# Creates: prod-triage-handler-us-east-2
sns_topic_arn = aws_sns_topic.alarms.arn
}
Lambda Configuration
Adjust Lambda resources based on your needs (defaults shown):
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"
# Orchestrator Lambda configuration
lambda_timeout = 900 # Default: 15 minutes (hard-coded maximum)
lambda_memory_size = 512 # Default: 512 MB
# Tool Lambda configuration
tool_lambda_timeout = 120 # Default: 2 minutes
tool_lambda_memory_size = 512 # Default: 512 MB
tool_lambda_reserved_concurrency = -1 # Default: no limit
sns_topic_arn = aws_sns_topic.alarms.arn
}
Deduplication Window
Control how often the same alarm is investigated:
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"
investigation_window_hours = 4 # Only investigate same alarm every 4 hours
sns_topic_arn = aws_sns_topic.alarms.arn
}
S3 Reports Configuration
Configure the S3 bucket for storing investigation reports:
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"
# Optional: Configure S3 access logging
reports_bucket_logging = {
target_bucket = "my-logging-bucket"
target_prefix = "alarm-reports/"
}
# Optional: Auto-delete old reports after 90 days
reports_bucket_lifecycle_days = 90
sns_topic_arn = aws_sns_topic.alarms.arn
}
Security Considerations
IAM Permissions
- Tool Lambda has read-only access to most AWS services
- Explicit deny policies prevent access to sensitive data
- No ability to modify resources or access secrets
Data Protection
- No customer data is stored beyond the investigation window
- DynamoDB entries auto-expire via TTL
- All logs respect CloudWatch retention policies
Network Security
- Lambdas run in AWS-managed VPC by default
- Can be configured for VPC deployment if needed
- All AWS API calls use TLS encryption
Compliance
- GDPR-compliant (no PII processing)
- SOC 2 compatible design
- Audit trail via CloudWatch Logs and CloudTrail
Troubleshooting
Common Issues
Module fails to deploy with Bedrock model error
- Bedrock models not available in your region
- Solution: Deploy in us-east-2 or another supported region
Alarm doesnβt trigger triage Lambda
- Check CloudWatch alarm configuration includes
module.triage.triage_lambda_arn - Verify Lambda resource policy allows CloudWatch invocation
- Check alarm state:
aws cloudwatch describe-alarms --alarm-names "your-alarm"
No email notifications received
- Confirm SNS subscription (check spam folder for confirmation)
- Verify SNS topic ARN is correct in module configuration
- Check orchestrator Lambda logs for SNS publish errors
Tool Lambda timing out
- Increase
tool_lambda_timeout(default: 60 seconds) - Check specific Python code execution in tool Lambda logs
- Some operations (large log queries, pagination) may need more time
Investigation seems incomplete
- Consider using a more capable model like Claude Opus
- Check if tool Lambda is hitting memory limits
- Increase lambda_timeout if investigations are timing out
Duplicate investigations occurring
- Check
investigation_window_hourssetting - Verify DynamoDB table TTL is enabled
- Look for multiple alarm evaluations in CloudWatch
Viewing Logs
Check CloudWatch Logs for debugging:
# Orchestrator Lambda logs
aws logs tail /aws/lambda/triage-handler --follow
# Tool Lambda logs
aws logs tail /aws/lambda/tool-lambda --follow
# Filter for specific alarm
aws logs filter-log-events \
--log-group-name /aws/lambda/triage-handler \
--filter-pattern "alarm-name"
Testing
Test the module with a manual alarm trigger:
# Manually put an alarm into ALARM state
aws cloudwatch set-alarm-state \
--alarm-name "test-alarm" \
--state-value ALARM \
--state-reason "Manual test"
Cost Optimization
Estimated Costs
Based on typical usage (100 alarms/month with default 512MB Lambda memory):
- Bedrock: ~$3-10/month (depends on investigation complexity)
- Lambda: <$1/month (optimized with 512MB default memory)
- DynamoDB: <$1/month
- CloudWatch Logs: <$1/month
- S3: <$1/month (for report storage)
- SNS: <$1/month
Cost Reduction Tips
- Use cost-effective models - Nova Premier available as alternative for cost optimization
- Increase deduplication window - Reduce duplicate investigations
- Use reserved concurrency - Prevent runaway Lambda costs
- Configure log retention - Reduce CloudWatch Logs storage
- Optimize Lambda memory - Adjust based on actual usage
- Set log_level to ERROR in production - Reduce CloudWatch Logs volume
Testing
The module includes comprehensive test coverage with 252 tests achieving 96% code coverage:
# Run all tests
python -m pytest tests/ -v
# Run unit tests only (245 tests)
python -m pytest tests/unit/ -v
# Run integration tests only (7 tests)
python -m pytest tests/integration/ -v
# Run with coverage report
python -m pytest tests/ --cov=lambda --cov=tool-lambda --cov-report=term-missing
Test Categories
- Deduplication & Formatting: DynamoDB deduplication logic, notification formatting
- Malformed Events: Edge cases, null values, invalid configurations
- Performance & Load: Concurrency, memory/CPU stress, high-volume operations
- Security Boundaries: IAM permissions, credential protection, injection prevention
- Integration: End-to-end alarm investigation workflow
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
MIT License - see LICENSE file for details
Support
For issues or questions:
Acknowledgments
- Compatible with multiple AWS Bedrock models including Claude and Nova
- Powered by AWS Bedrock
- Terraform module best practices from HashiCorp