Transactional Outbox with RabbitMQ (Part 2): Handling Retries, Dead-Letter Queues, and Observability

In Part 1, we built a correct and production-oriented foundation for the transactional outbox pattern using Go, PostgreSQL, and RabbitMQ. We ensured that domain state changes and event creation were atomic, published events asynchronously via a poll-based outbox worker, and enforced exactly-once processing on the consumer side using idempotency.

That baseline system already solves the hardest problem: correctness under partial failure.

But correctness alone isn’t enough in real-world systems.

Once this pattern hits production, new questions immediately surface:

What happens when publishing to RabbitMQ fails repeatedly?
How do we retry safely without flooding the broker or duplicating work?
How do consumers handle poison messages?
When should events be retried, and when should they be sent to a dead-letter queue?
How do we observe failures early, before downstream systems silently degrade?
At higher scale, is polling still the right choice or should CDC-based approaches be considered?

This article answers those questions.

In Part 2, we extend the system from Part 1 with failure-aware behavior and operational guardrails, without changing the core model. The transactional outbox remains the source of truth; we simply make it resilient, observable, and easier to operate under load.

Specifically, we’ll cover:

Retry strategies for the outbox publisher with exponential backoff
Dead Letter Queues (DLQs) for events that cannot be published
Consumer-side retries and dead-letter exchanges
Operational metrics and dashboards that surface failure modes early
A practical discussion of Polling vs CDC-based outbox implementations (Debezium + Kafka), including trade-offs and when each approach makes sense

This is not a rewrite of the system from Part 1. Instead, it’s an evolution, the kind that happens after the first real incident.

By the end of this article, you’ll understand how to take a correct outbox implementation and turn it into one that’s boring to operate, even when things go wrong.

Table of Content

Failure as a First-Class Design Constraint
Retrying Outbox Publishing (Producer-Side Retries)
Moving permanently failed events to a DLQ
Consumer-Side Retries Using Dead-Letter Exchanges
What To Do With Messages in the DLQ
Retry & Failure Metrics
Tracing & Outcome Visibility
CDC-Based Outbox as an Alternative
Conclusion

Loading more...