In Part 1, we built a correct and production-oriented foundation for the transactional outbox pattern using Go, PostgreSQL, and RabbitMQ. We ensured that domain state changes and event creation were atomic, published events asynchronously via a poll-based outbox worker, and enforced exactly-once processing on the consumer side using idempotency.
That baseline system already solves the hardest problem: correctness under partial failure.
But correctness alone isn’t enough in real-world systems.
Once this pattern hits production, new questions immediately surface:
- What happens when publishing to RabbitMQ fails repeatedly?
- How do we retry safely without flooding the broker or duplicating work?
- How do consumers handle poison messages?
- When should events be retried, and when should they be sent to a dead-letter queue?
- How do we observe failures early, before downstream systems silently degrade?
- At higher scale, is polling still the right choice or should CDC-based approaches be considered?
This article answers those questions.
In Part 2, we extend the system from Part 1 with failure-aware behavior and operational guardrails, without changing the core model. The transactional outbox remains the source of truth; we simply make it resilient, observable, and easier to operate under load.
Specifically, we’ll cover:
- Retry strategies for the outbox publisher with exponential backoff
- Dead Letter Queues (DLQs) for events that cannot be published
- Consumer-side retries and dead-letter exchanges
- Operational metrics and dashboards that surface failure modes early
- A practical discussion of Polling vs CDC-based outbox implementations (Debezium + Kafka), including trade-offs and when each approach makes sense
This is not a rewrite of the system from Part 1. Instead, it’s an evolution, the kind that happens after the first real incident.
By the end of this article, you’ll understand how to take a correct outbox implementation and turn it into one that’s boring to operate, even when things go wrong.
Table of Content
- Failure as a First-Class Design Constraint
- Retrying Outbox Publishing (Producer-Side Retries)
- Moving permanently failed events to a DLQ
- Consumer-Side Retries Using Dead-Letter Exchanges
- What To Do With Messages in the DLQ
- Retry & Failure Metrics
- Tracing & Outcome Visibility
- CDC-Based Outbox as an Alternative
- Conclusion