π AWS Intelligent Document Processing with Amazon Textract
A production-ready, serverless document processing system built with AWS AI services. Extract text, tables, and form data from documents automatically using Amazon Textract, Lambda, and DynamoDB.
Perfect for: Learning AWS AI services, building portfolios, AWS certifications, or implementing automated document workflows.
π― What This Project Does
- β Automatic document processing - Upload PDFs/images to S3, get structured data automatically
- β AI-powered extraction - Uses Amazon Textract to extract text, tables, and form fields
- β Serverless architecture - No servers to manage, scales automatically
- β Complete infrastructure - One-command deployment using AWS SAM
- β Production-ready - Includesβ¦
π AWS Intelligent Document Processing with Amazon Textract
A production-ready, serverless document processing system built with AWS AI services. Extract text, tables, and form data from documents automatically using Amazon Textract, Lambda, and DynamoDB.
Perfect for: Learning AWS AI services, building portfolios, AWS certifications, or implementing automated document workflows.
π― What This Project Does
- β Automatic document processing - Upload PDFs/images to S3, get structured data automatically
- β AI-powered extraction - Uses Amazon Textract to extract text, tables, and form fields
- β Serverless architecture - No servers to manage, scales automatically
- β Complete infrastructure - One-command deployment using AWS SAM
- β Production-ready - Includes error handling, monitoring, and notifications
- β Cost-optimized - Pay only for what you use (~$16/month for 1000 documents)
ποΈ Architecture
User Upload β S3 Bucket β Lambda (Textract) β DynamoDB β SNS Notification
AWS Services Used:
- Amazon S3 - Document storage with lifecycle policies
- AWS Lambda - Serverless compute for processing
- Amazon Textract - AI-powered document analysis
- Amazon DynamoDB - NoSQL database for extracted data
- Amazon SNS - Email notifications
- AWS CloudWatch - Monitoring and alarms
- AWS X-Ray - Distributed tracing
β Prerequisites
Before you begin, ensure you have:
- AWS Account - Create one here
- AWS CLI - Installation guide
- AWS SAM CLI - Installation guide
- Python 3.11+ - Download here
Configure AWS CLI
aws configure
# Enter your AWS Access Key ID
# Enter your AWS Secret Access Key
# Enter your default region (e.g., us-east-1)
# Enter your default output format (json)
Required IAM Permissions
Your IAM user needs the following AWS Managed Policies to deploy this application.
Recommended Approach: Use IAM User Groups
Create an IAM User Group (e.g., tania-builder-saml or sam-deployers)
- Go to IAM β User groups β Create group
- Give it a descriptive name
Attach the following managed policies to the group:
AWSCloudFormationFullAccess- Create and manage CloudFormation stacksAWSLambda_FullAccess- Create and manage Lambda functionsIAMFullAccess- Create IAM roles for Lambda executionAmazonS3FullAccess- Manage S3 buckets and objectsAmazonDynamoDBFullAccess- Create and manage DynamoDB tablesAmazonSNSFullAccess- Create and manage SNS topicsAmazonSQSFullAccess- Create Dead Letter QueueCloudWatchLogsFullAccess- Create and manage log groupsAWSXRayFullAccess- Enable X-Ray tracing
Add your IAM user to the group
- IAM β Users β [Your User] β Groups β Add user to groups
- Select the group you created
Why use groups? This is AWS best practice - it makes it easier to manage permissions for multiple users and maintain consistency.
Alternative: You can attach these policies directly to your IAM user, but using groups is recommended for better permission management.
π Quick Start
Option 1: Automated Setup (Recommended)
# Clone the repository
git clone https://github.com/Tetianamost/aws-intelligent-document-processing.git
cd aws-intelligent-document-processing
# Run the setup script
./setup.sh
The script will:
- Check all prerequisites
- Prompt for your email and configuration
- Build and deploy the application
- Display your S3 bucket name and usage commands
Option 2: Manual Setup
# 1. Build the application
sam build
# 2. Deploy with guided prompts
sam deploy --guided
# You'll be prompted for:
# - Stack name: e.g., doc-processing-dev
# - AWS Region: e.g., us-east-1
# - NotificationEmail: your email address
# - Confirm changes: Y
# - Allow SAM CLI IAM role creation: Y
# - Save arguments to config file: Y
3. Confirm SNS Email Subscription
Check your email and confirm the SNS subscription to receive notifications.
4. Get Your S3 Bucket Name
aws cloudformation describe-stacks \
--stack-name doc-processing-dev \
--query 'Stacks[0].Outputs[?OutputKey==`DocumentBucket`].OutputValue' \
--output text
π Sample Documents
The sample_documents/ directory contains example documents for testing:
- simple-invoice.png - Business invoice with itemized services and calculations (β Works perfectly)
- tax-form.png - IRS Form 1040 converted to PNG (β Works perfectly - 1,412 blocks extracted)
- sample-invoice.pdf - IRS Form 1040 as PDF (β Fails - PDF format incompatible)
- filled-invoice.pdf - PDF invoice generated with reportlab (β Fails - PDF format incompatible)
β οΈ Important: Document Format Compatibility
Textract works best with:
- PNG, JPG, TIFF images - Most reliable format
- Scanned PDFs - PDFs created from scanned documents
- Native PDFs - PDFs from tools like Adobe, Microsoft Office
May not work:
- PDFs generated by some Python libraries (reportlab, fpdf) - these may use unsupported internal formats
- Heavily compressed or encrypted PDFs
Recommendation: If youβre programmatically generating documents for Textract, create them as PNG/JPG images rather than PDFs to ensure compatibility.
Great document sources for testing:
- Your own receipts or invoices
- Bank statements
- Medical lab results
- Tax forms (W-2, 1040, etc.)
π Usage Examples
Upload a Document for Processing
# Upload a PDF
aws s3 cp your-document.pdf s3://YOUR-BUCKET-NAME/incoming/
# Upload a PNG image
aws s3 cp your-image.png s3://YOUR-BUCKET-NAME/incoming/
What happens next:
- S3 triggers Lambda automatically
- Lambda calls Textract to analyze the document
- Extracted data is saved to DynamoDB
- Document is moved to
processed/folder - You receive an email notification
Query Extracted Data
# View all processed documents
aws dynamodb scan \
--table-name doc-processing-dev-documents \
--max-items 5
# Query by document ID
aws dynamodb get-item \
--table-name doc-processing-dev-documents \
--key '{"document_id": {"S": "your-doc-id"}, "upload_timestamp": {"S": "2024-01-15T10:30:00"}}'
View Logs
# View Lambda logs
sam logs --stack-name doc-processing-dev --tail
# View specific function logs
aws logs tail /aws/lambda/doc-processing-dev-processor --follow
π Project Structure
aws-intelligent-document-processing/
βββ README.md # This file
βββ template.yaml # SAM/CloudFormation template
βββ setup.sh # Automated setup script
βββ .gitignore # Git ignore rules
β
βββ src/
β βββ document_processor/
β βββ app.py # Main Lambda handler
β βββ requirements.txt # Python dependencies
β βββ utils/
β βββ __init__.py
β βββ textract_parser.py # Textract response parser
β βββ dynamo_handler.py # DynamoDB operations
β
βββ sample_documents/ # Test documents
βββ tests/ # Unit and integration tests
βββ docs/ # Additional documentation
π° Cost Estimation
Costs for processing 1,000 documents per month in us-east-1:
| Service | Usage | Monthly Cost |
|---|---|---|
| Amazon S3 | 10GB storage + 1K PUT/GET | $0.25 |
| AWS Lambda | 1K invocations Γ 30s Γ 512MB | $0.10 |
| Amazon Textract | 1K pages analyzed | $15.00 |
| DynamoDB | 1K writes + 5K reads | $0.30 |
| Amazon SNS | 1K notifications | $0.50 |
| CloudWatch Logs | 1GB logs | $0.50 |
| TOTAL | ~$16.65 |
Free Tier Benefits:
- Lambda: First 1M requests/month free
- DynamoDB: 25GB storage + 25 RCU/WCU free
- S3: First 5GB free
π Troubleshooting
Issue: SAM build fails
# Make sure you have Python 3.11+
python3 --version
# Install SAM CLI
pip install aws-sam-cli
# Try building with verbose output
sam build --use-container --debug
Issue: Lambda timeout errors
Increase timeout in template.yaml:
Globals:
Function:
Timeout: 600 # 10 minutes
Issue: "Access Denied" errors
# Verify IAM permissions
aws sts get-caller-identity
# Ensure your AWS credentials have necessary permissions
Issue: Not receiving SNS notifications
- Check spam folder
- Confirm SNS subscription in AWS Console
- Verify email in
template.yamlis correct
π€ Contributing
Contributions are welcome! Hereβs how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π Learning Resources
- Amazon Textract Documentation
- AWS SAM Documentation
- AWS Lambda Best Practices
- DynamoDB Best Practices
π License
This project is licensed under the MIT License.
π Contact
- GitHub Issues: Report bugs or request features
- Discussions: Ask questions and share ideas
β Found this helpful? Star the repo and share with others learning AWS!
π Found a bug? Open an issue and letβs fix it together!