AI-Powered AWS CloudWatch Alarm Triage Terraform Module

CloudWatch Alarm Triage with AWS Bedrock

Overview

A reusable Terraform module that integrates AWS CloudWatch Alarms with AWS Bedrock models in agent mode to automatically investigate and triage alarms. The AI model operates as an agent with access to a Python execution tool that can run boto3 operations and analysis code with pre-imported libraries, providing deep investigation capabilities. This solution provides engineers with comprehensive contextual information and preliminary analysis before they respond to incidents, significantly reducing mean time to resolution (MTTR).

Model Compatibility:

Works with all AWS Bedrock foundational models (Claude, Nova, Titan, etc.)
Defaults to Claude Sonnet 4.5 (…

CloudWatch Alarm Triage with AWS Bedrock

Created by Wayne Workman

Overview

Model Compatibility:

Works with all AWS Bedrock foundational models (Claude, Nova, Titan, etc.)
Defaults to Claude Sonnet 4.5 (global.anthropic.claude-sonnet-4-5-20250929-v1:0) - currently the best performing model while also being fast and cost-effective
Smaller/cheaper models are less capable; newer/larger models are more capable
Use the global. or us. prefix to enable cross-region inference
Module uses the region of your calling Terraform provider

Key Features

System Inference Profile: Uses AWS-managed inference profile for reliable model invocation
Robust Error Handling: Automatic retries with exponential backoff for API timeouts
5-Minute Read Timeout: Extended timeout for handling complex investigations
100 Tool Call Iterations: Supports thorough multi-step investigations
DynamoDB Deduplication: Prevents duplicate investigations with configurable time window
Pre-imported Python Modules: Fast execution with 40+ pre-imported Python libraries
Concurrent Execution Control: Prevents overlapping investigations
Security-First Import Control: Automatic import statement removal ensures only authorized pre-imported modules are used
Enhanced Visibility: Full context tracking of all AI model interactions with detailed S3 reporting
Configurable Logging: Environment-based logging levels (ERROR/INFO/DEBUG) for production optimization

Architecture

Two-Lambda Design

This module creates two Lambda functions working together:

Orchestrator Lambda (Python 3.13)

Receives CloudWatch Alarm events
Invokes Bedrock model in agent mode
Sends investigation results to SNS
Minimal IAM permissions (Bedrock, SNS, Logs)

Tool Lambda (Python 3.13)

Called by the AI model as a tool during investigation
Executes Python code with pre-imported modules
Uses AWS managed ReadOnlyAccess policy with deny statements
Prevents access to sensitive data (S3 objects, DynamoDB data, secrets)
All standard library and AWS SDK modules pre-imported for performance

Workflow

CloudWatch Alarm (ALARM state)
↓
Orchestrator Lambda
↓
DynamoDB Deduplication Check
↓ (if not recently investigated)
Bedrock Model (agent mode)
↓ (multiple tool calls)
Tool Lambda (Python executor)
↓ (returns findings)
AI Analysis & Root Cause
↓
Save Report to S3 Bucket
↓
SNS Email Notification

Deduplication

The module uses DynamoDB to prevent duplicate investigations of the same alarm within a configurable time window (default: 1 hour). This prevents multiple emails when CloudWatch continuously evaluates an alarm in ALARM state. The DynamoDB entries automatically expire using TTL.

Investigation Reports Storage

All investigation reports are automatically saved to an S3 bucket with enhanced visibility features:

File Organization

Date-based structure: reports/YYYY/MM/DD/
Timestamp-first naming: YYYYMMDD_HHMMSS_UTC_{alarm_name}_{type}.{ext}
Three files per investigation:
*_report.txt - Human-readable investigation report
*_full_context.txt - Complete AI model conversation history
*.json - Structured data with all metadata

Storage Features

Encryption: Server-side encryption with AES256
Versioning: Enabled for audit trail
Access Control: Public access blocked with bucket ACLs
Optional Logging: Configure access logging to track report access
Optional Lifecycle: Auto-delete old reports after specified days
Naming Convention: {prefix}-alarm-reports-{random} for bucket uniqueness

Enhanced Visibility Data

Each investigation captures:

Iteration count: Total number of Bedrock API invocations
Tool call history: Complete record of all Python code executions
Full conversation context: Every interaction between orchestrator and AI model
Timestamps: Precise UTC timestamps for all operations
Alarm metadata: Complete CloudWatch alarm event data

Module Structure

cloudwatch-alarm-triage/
├── main.tf                 # Core module resources
├── variables.tf            # Module input variables
├── outputs.tf              # Module outputs
├── versions.tf             # Provider version constraints
├── lambda/
│   ├── triage_handler.py   # Main Lambda function
│   ├── bedrock_client.py   # Bedrock integration
│   ├── prompt_template.py  # AI prompt template
│   └── requirements.txt    # Python dependencies
├── tool-lambda/
│   └── tool_handler.py     # Python executor handler
├── tests/
│   ├── unit/               # Unit tests (204 tests)
│   ├── integration/        # Integration tests (7 tests)
│   └── conftest.py         # Pytest configuration
├── demo/                   # Complete working example
│   ├── main.tf             # Demo deployment
│   ├── failing_lambda.tf   # Intentionally failing Lambda
│   ├── alarms.tf           # CloudWatch alarm config
│   ├── lambda_code/        # Demo Lambda code
│   ├── README.md           # Demo documentation
│   └── DEMO_SUMMARY.md     # Technical summary
└── README.md               # This file

Quick Start

1. Deploy the Module

# Configure provider for region with Bedrock models
provider "aws" {
region = "us-east-2"  # Ensure Bedrock models are available in your region
}

# Create SNS topic for notifications
resource "aws_sns_topic" "alarm_notifications" {
name = "cloudwatch-alarm-investigations"
}

resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alarm_notifications.arn
protocol  = "email"
endpoint  = "your-email@example.com"
}

# Deploy the triage module
module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

sns_topic_arn = aws_sns_topic.alarm_notifications.arn

# Optional: Override the default model (Claude Opus 4.1 with cross-region inference)
# bedrock_model_id = "us.amazon.nova-premier-v1:0"  # For cost optimization

# Optional: Configure logging level
# log_level = "DEBUG"  # Options: ERROR, INFO (default), DEBUG

tags = {
Environment = "production"
ManagedBy   = "terraform"
}
}

2. Configure CloudWatch Alarms

Add the triage Lambda as an action on your CloudWatch alarms:

resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
alarm_name          = "lambda-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods  = "2"
metric_name         = "Errors"
namespace           = "AWS/Lambda"
period              = "60"
statistic           = "Sum"
threshold           = "10"
alarm_description   = "Triggers when Lambda errors exceed threshold"

dimensions = {
FunctionName = aws_lambda_function.my_function.function_name
}

# Add triage Lambda as alarm action
alarm_actions = [
module.alarm_triage.triage_lambda_arn
]
}

3. Confirm Email Subscription

Check your email and confirm the SNS subscription to receive investigation results.

Module Inputs

Variable	Description	Type	Default
`log_level`	Logging level for Lambda functions (ERROR, INFO, DEBUG)	`string`	`"INFO"`
`sns_topic_arn`	SNS topic ARN for sending investigation results	`string`	Required
`bedrock_model_id`	Bedrock model identifier	`string`	`"global.anthropic.claude-sonnet-4-5-20250929-v1:0"`
`lambda_timeout`	Timeout for orchestrator Lambda in seconds	`number`	`900`
`lambda_memory_size`	Memory for orchestrator Lambda in MB	`number`	`1024`
`tool_lambda_timeout`	Timeout for tool Lambda in seconds	`number`	`60`
`tool_lambda_memory_size`	Memory for tool Lambda in MB	`number`	`2048`
`tool_lambda_reserved_concurrency`	Reserved concurrent executions for tool Lambda	`number`	`-1` (unreserved)
`investigation_window_hours`	Hours before re-investigating same alarm	`number`	`1`
`resource_prefix`	Prefix for all created resources	`string`	`""`
`resource_suffix`	Suffix for all created resources	`string`	`""`
`tags`	Tags to apply to all resources	`map(string)`	`{}`

Module Outputs

Output	Description
`triage_lambda_arn`	ARN of the triage Lambda function
`triage_lambda_name`	Name of the triage Lambda function
`tool_lambda_arn`	ARN of the tool Lambda function
`tool_lambda_name`	Name of the tool Lambda function
`triage_lambda_log_group`	CloudWatch Logs group for the triage Lambda
`tool_lambda_log_group`	CloudWatch Logs group for the tool Lambda
`bedrock_model_id`	The Bedrock model ID being used
`dynamodb_table_name`	Name of the DynamoDB table for deduplication

Investigation Capabilities

The AI model can investigate alarms by executing Python code with pre-imported modules:

Security-First Import Control

The tool Lambda implements automatic import statement removal as a critical security and control feature. This ensures that only authorized, pre-imported modules can be used in the execution environment, preventing:

Unauthorized module usage - No ability to import unapproved libraries
Supply chain attacks - Cannot introduce external dependencies
Resource exhaustion - Prevents importing resource-intensive modules
Data exfiltration - Blocks attempts to import networking libraries not pre-approved

When the AI model includes import statements, the system automatically:

Detects and removes all import statements using AST parsing
Logs the removed imports for security audit trails
Continues execution with only the pre-authorized modules

This security control also provides compatibility benefits, allowing the module to work reliably with various AI model tiers. The system transparently handles code like:

import boto3
import datetime
import json

# Your investigation code here...

The system automatically strips these imports and executes the code with only the pre-authorized modules, maintaining strict security boundaries while ensuring functionality.

Pre-Imported Modules Available

Core AWS & Data

boto3 - AWS SDK for all AWS operations
json - JSON encoding/decoding
csv - CSV file operations
base64 - Base64 encoding/decoding

Date & Time

datetime - Full datetime module (use datetime.datetime, datetime.timedelta, etc.)
time - Time-related functions

Text & Pattern Matching

re - Regular expressions
string - String constants and utilities
textwrap - Text wrapping and filling
difflib - Helpers for computing differences
fnmatch - Unix-style pattern matching
glob - Unix-style pathname pattern expansion

Data Structures & Algorithms

collections - Counter, defaultdict, OrderedDict, etc.
itertools - Functions for creating iterators
functools - Higher-order functions
operator - Standard operators as functions
copy - Shallow and deep copy operations

Network & Security

ipaddress - IP network/address manipulation
hashlib - Secure hash algorithms
urllib - URL handling modules
uuid - UUID generation

Math & Statistics

math - Mathematical functions
statistics - Statistical functions
random - Random number generation
decimal - Decimal arithmetic
fractions - Rational number arithmetic

System & Utility

os - Operating system interface (limited in Lambda)
sys - System-specific parameters
platform - Platform identification
traceback - Traceback utilities
warnings - Warning control
pprint - Pretty printer

Type Hints & Data Classes

enum - Support for enumerations
dataclasses - Data class support
typing - Type hints support

I/O Operations

StringIO - In-memory text streams
BytesIO - In-memory byte streams

Compression

gzip - Gzip compression
zlib - Compression library
tarfile - Tar archive access
zipfile - ZIP archive access

Investigation Examples

CloudWatch Logs Analysis - Filter and analyze application logs
Metric Statistics - Review trends and anomalies
IAM Permissions - Check roles and policies
CloudTrail Events - Find recent API calls
EC2 Instances - Examine infrastructure state
Lambda Functions - Review configurations and errors
Complex Analysis - Pattern detection, cost optimization, multi-resource correlation

Security Model

The tool Lambda uses:

AWS managed ReadOnlyAccess policy for comprehensive resource inspection
Explicit deny statements for sensitive data:
S3 object content (can list, cannot read)
DynamoDB data (can describe, cannot query)
Secrets Manager values (can list, cannot retrieve)
Parameter Store SecureString parameters (can list, cannot decrypt)

Example Investigation Output

When a Lambda function alarm triggers, the AI model might provide:

🚨 EXECUTIVE SUMMARY

Lambda function prod-api-handler experiencing 100% error rate due to missing DynamoDB table permissions. Immediate action required: Add dynamodb:GetItem permission to Lambda role.

🔍 INVESTIGATION DETAILS

Python code executed:

CloudWatch Logs analysis:

logs = boto3.client('logs')
response = logs.filter_log_events(
logGroupName='/aws/lambda/prod-api-handler',
startTime=int((datetime.datetime.now() - datetime.timedelta(minutes=5)).timestamp() * 1000)
)

Found 47 AccessDenied errors in past 5 minutes
All errors: “User: arn:aws:sts::123456789012:assumed-role/lambda-role/prod-api-handler is not authorized to perform: dynamodb:GetItem on resource: arn:aws:dynamodb:us-east-2:123456789012:table/UserData”

IAM role policy check:

iam = boto3.client('iam')
policies = iam.list_attached_role_policies(RoleName='lambda-role')

Only AWSLambdaBasicExecutionRole attached
No DynamoDB permissions found

DynamoDB table verification:

dynamodb = boto3.client('dynamodb')
table = dynamodb.describe_table(TableName='UserData')

Table exists and is ACTIVE
No resource-level restrictions

📊 ROOT CAUSE

Lambda role lacks dynamodb:GetItem permission for the UserData table. This occurred after the recent IAM policy update that removed the overly permissive * resource access.

💥 IMPACT ASSESSMENT

Affected Resources: prod-api-handler Lambda function
Business Impact: API completely unavailable, all requests failing
Severity Level: Critical
Users Affected: All users (estimated 5,000+ active)

🔧 IMMEDIATE ACTIONS

Add DynamoDB permissions to Lambda role:

aws iam attach-role-policy \
--role-name lambda-role \
--policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess

Time estimate: 2 minutes 1.

Verify function recovery:

Monitor CloudWatch metrics for error rate drop
Test API endpoints manually Time estimate: 5 minutes

🛡️ PREVENTION MEASURES

Implement least-privilege IAM policies with explicit resource ARNs
Add pre-deployment IAM policy validation
Create Lambda function tests that verify DynamoDB access

📈 MONITORING RECOMMENDATIONS

Set alarm threshold to 5 errors (current: 10)
Add custom metric for DynamoDB throttling
Create dashboard showing Lambda errors by error type

Advanced Configuration

Resource Naming

Control resource names with prefixes and suffixes:

module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

resource_prefix = "prod"
resource_suffix = "us-east-2"
# Creates: prod-triage-handler-us-east-2

sns_topic_arn = aws_sns_topic.alarms.arn
}

Lambda Configuration

Adjust Lambda resources based on your needs (defaults shown):

module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

# Orchestrator Lambda configuration
lambda_timeout      = 900  # Default: 15 minutes (hard-coded maximum)
lambda_memory_size  = 512  # Default: 512 MB

# Tool Lambda configuration
tool_lambda_timeout              = 120  # Default: 2 minutes
tool_lambda_memory_size          = 512  # Default: 512 MB
tool_lambda_reserved_concurrency = -1   # Default: no limit

sns_topic_arn = aws_sns_topic.alarms.arn
}

Deduplication Window

Control how often the same alarm is investigated:

module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

investigation_window_hours = 4  # Only investigate same alarm every 4 hours

sns_topic_arn = aws_sns_topic.alarms.arn
}

S3 Reports Configuration

Configure the S3 bucket for storing investigation reports:

module "alarm_triage" {
source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

# Optional: Configure S3 access logging
reports_bucket_logging = {
target_bucket = "my-logging-bucket"
target_prefix = "alarm-reports/"
}

# Optional: Auto-delete old reports after 90 days
reports_bucket_lifecycle_days = 90

sns_topic_arn = aws_sns_topic.alarms.arn
}

Security Considerations

IAM Permissions

Tool Lambda has read-only access to most AWS services
Explicit deny policies prevent access to sensitive data
No ability to modify resources or access secrets

Data Protection

No customer data is stored beyond the investigation window
DynamoDB entries auto-expire via TTL
All logs respect CloudWatch retention policies

Network Security

Lambdas run in AWS-managed VPC by default
Can be configured for VPC deployment if needed
All AWS API calls use TLS encryption

Compliance

GDPR-compliant (no PII processing)
SOC 2 compatible design
Audit trail via CloudWatch Logs and CloudTrail

Troubleshooting

Common Issues

Module fails to deploy with Bedrock model error

Bedrock models not available in your region
Solution: Deploy in us-east-2 or another supported region

Alarm doesn’t trigger triage Lambda

Check CloudWatch alarm configuration includes module.triage.triage_lambda_arn
Verify Lambda resource policy allows CloudWatch invocation
Check alarm state: aws cloudwatch describe-alarms --alarm-names "your-alarm"

No email notifications received

Confirm SNS subscription (check spam folder for confirmation)
Verify SNS topic ARN is correct in module configuration
Check orchestrator Lambda logs for SNS publish errors

Tool Lambda timing out

Increase tool_lambda_timeout (default: 60 seconds)
Check specific Python code execution in tool Lambda logs
Some operations (large log queries, pagination) may need more time

Investigation seems incomplete

Consider using a more capable model like Claude Opus
Check if tool Lambda is hitting memory limits
Increase lambda_timeout if investigations are timing out

Duplicate investigations occurring

Check investigation_window_hours setting
Verify DynamoDB table TTL is enabled
Look for multiple alarm evaluations in CloudWatch

Viewing Logs

Check CloudWatch Logs for debugging:

# Orchestrator Lambda logs
aws logs tail /aws/lambda/triage-handler --follow

# Tool Lambda logs
aws logs tail /aws/lambda/tool-lambda --follow

# Filter for specific alarm
aws logs filter-log-events \
--log-group-name /aws/lambda/triage-handler \
--filter-pattern "alarm-name"

Testing

Test the module with a manual alarm trigger:

# Manually put an alarm into ALARM state
aws cloudwatch set-alarm-state \
--alarm-name "test-alarm" \
--state-value ALARM \
--state-reason "Manual test"

Cost Optimization

Estimated Costs

Based on typical usage (100 alarms/month with default 512MB Lambda memory):

Bedrock: ~$3-10/month (depends on investigation complexity)
Lambda: <$1/month (optimized with 512MB default memory)
DynamoDB: <$1/month
CloudWatch Logs: <$1/month
S3: <$1/month (for report storage)
SNS: <$1/month

Cost Reduction Tips

Use cost-effective models - Nova Premier available as alternative for cost optimization
Increase deduplication window - Reduce duplicate investigations
Use reserved concurrency - Prevent runaway Lambda costs
Configure log retention - Reduce CloudWatch Logs storage
Optimize Lambda memory - Adjust based on actual usage
Set log_level to ERROR in production - Reduce CloudWatch Logs volume

Testing

The module includes comprehensive test coverage with 252 tests achieving 96% code coverage:

# Run all tests
python -m pytest tests/ -v

# Run unit tests only (245 tests)
python -m pytest tests/unit/ -v

# Run integration tests only (7 tests)
python -m pytest tests/integration/ -v

# Run with coverage report
python -m pytest tests/ --cov=lambda --cov=tool-lambda --cov-report=term-missing

Test Categories

Deduplication & Formatting: DynamoDB deduplication logic, notification formatting
Malformed Events: Edge cases, null values, invalid configurations
Performance & Load: Concurrency, memory/CPU stress, high-volume operations
Security Boundaries: IAM permissions, credential protection, injection prevention
Integration: End-to-end alarm investigation workflow

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details

Support

For issues or questions:

Open an issue on GitHub
Contact via LinkedIn

Acknowledgments

Compatible with multiple AWS Bedrock models including Claude and Nova
Powered by AWS Bedrock
Terraform module best practices from HashiCorp

CloudWatch Alarm Triage with AWS Bedrock

Overview

CloudWatch Alarm Triage with AWS Bedrock

Overview

Key Features

Architecture

Two-Lambda Design

Workflow

Deduplication

Investigation Reports Storage

File Organization

Storage Features

Enhanced Visibility Data

Module Structure

Quick Start

1. Deploy the Module

2. Configure CloudWatch Alarms

3. Confirm Email Subscription

Module Inputs

Module Outputs

Investigation Capabilities

Security-First Import Control

Pre-Imported Modules Available

Core AWS & Data

Date & Time

Text & Pattern Matching

Data Structures & Algorithms

Network & Security

Math & Statistics

System & Utility

Type Hints & Data Classes

I/O Operations

Compression

Investigation Examples

Security Model

Example Investigation Output

🚨 EXECUTIVE SUMMARY

🔍 INVESTIGATION DETAILS

📊 ROOT CAUSE

💥 IMPACT ASSESSMENT

🔧 IMMEDIATE ACTIONS

🛡️ PREVENTION MEASURES

📈 MONITORING RECOMMENDATIONS

Advanced Configuration

Resource Naming

Lambda Configuration

Deduplication Window

S3 Reports Configuration

Security Considerations

IAM Permissions

Data Protection

Network Security

Compliance

Troubleshooting

Common Issues

Viewing Logs

Testing

Cost Optimization

Estimated Costs

Cost Reduction Tips

Testing

Test Categories

Contributing

License

Support

Acknowledgments

Similar Posts