CloudEvents as a Data Product

TL;DR Treat CloudEvents as the envelope for versioned, traceable aggregate snapshots. Use the CloudEvents contract for transport and a versioned aggregate schema in data to make events reliable data products for analytics, ML, and agentic workloads.

This post builds on the Event as a Data Product pattern I’ve coined. In short: treat domain aggregates (entities and objects) as the canonical interface for your service, and expose those aggregates through APIs, events, and data products. That reduces impedance between channels and makes data easier to reuse.

Here I unpack why CloudEvents fits event-driven data products, practical …

TL;DR Treat CloudEvents as the envelope for versioned, traceable aggregate snapshots. Use the CloudEvents contract for transport and a versioned aggregate schema in data to make events reliable data products for analytics, ML, and agentic workloads.

Here I unpack why CloudEvents fits event-driven data products, practical guidance for implementing it, and how this pattern helps with LLMs and agentic workflows.

CloudEvents spec: https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md

Why events

The problem with traditional data products

A common practice is using database extracts as data products, often treating them as source-aligned data products and building target-aligned layers on top. This is problematic because database structures may not directly reflect the domain model—they could be event-sourced logs, normalized, or optimized for other factors, making it hard to replicate in SQL queries for target-aligned views. Even if done, it duplicates logic across languages. Using operational databases directly as data products is a poor practice: it creates tight coupling between consumers and internal schemas, leading to breakage when databases evolve, and violates the “data on the outside” principle, where external interfaces should be independent of internal representations.

Events to the rescue

Events (we’ll use this term for business or domain events) are immutable records of significant occurrences in your domain, such as a customer placing an order or an insurance claim being approved. Unlike commands (which are requests to perform an action), events are statements of fact about what has already happened—they describe outcomes, not intentions. (See Events or Commands for the distinction.)

Events capture not just state but the why, when and who of a change. When you apply the “data on the outside” principle, events become a stream of point-in-time, versioned snapshots of your aggregates. That makes them a natural input for data products: they are timely, auditable, and carry contextual metadata.

Key benefits

Traceability: consumers can follow an entity’s history and reconstruct state at any point in time.
Loose coupling: event contracts (schemas) decouple producers and consumers, letting teams evolve independently.
Rich context: events naturally carry reason, source, and causality information that helps downstream reasoning.
Real-time reactivity and batch processing: Events facilitate real-time processing for immediate reactions to changes, while as data products they support batch analytics and AI workflows. For consumers (services or AI agents), the same schema serves both real-time and batch data sources, enabling consistent reasoning.
Semantic data foundation: Events provide contextual, semantic-rich data that helps build semantic pipelines for AI. Semantics (meaning, intent, relationships) are more appropriate for AI reasoning than syntax alone (structure without context), as they offer deeper insights for decision-making and understanding. For example, events explain “why” something happened (e.g., a customer order failed due to payment issues), not just “what” (e.g., order status changed).

Data products built from events will increasingly be consumed by LLMs and AI agents. Context has many aspects, and one crucial one is the path we took to reach the current state of an entity. Simply keeping the current state without this historical context is lossy—it loses important data that enables better reasoning and decision-making.

It also reminds me of a brief discussion on X/Twitter with thought leaders Vaughn Vernon (author of Implementing Domain-Driven Design) and Greg Young (creator of CQRS & Event Sourcing), where I argued that certain events should capture full context at the moment they occur to preserve temporal accuracy for correct business outcomes, instead of pure data changes only. This insight is even more relevant today with AI and LLM flows.

This pattern has been validated through industry partnerships. For instance, I partnered with Google to architect and present a KAPPA architecture implementation for building data products using Google’s eventing systems, following the “CloudEvents as a Data Product” pattern I introduced. I co-presented this at Google Cloud Next, one of the industry’s premier technical conferences, where I demonstrated a working implementation that could be deployed in under 5 minutes.

CloudEvents as the solution

CloudEvents enables this event-driven data product paradigm by providing a standardized, versioned envelope for events, decoupling producers and consumers while enabling reliable, queryable streams of aggregate snapshots.

Why CloudEvents?

CloudEvents is a focused, well-adopted specification for describing event envelopes. It provides:

A small, consistent envelope (id, type, source, subject, time, datacontenttype, data, etc.).
SDKs and tooling across many languages, lowering friction for adoption.
Extensibility hooks so teams can add domain-specific metadata without breaking the common contract.

These properties make CloudEvents a solid basis for a data-product-grade event contract: predictable, discoverable, and toolable.

What CloudEvents are (short): CloudEvents is a vendor-neutral event envelope spec that standardises a small set of context attributes (the event “context”) and leaves the domain payload in data. It defines REQUIRED attributes (id, source, specversion, type), OPTIONAL attributes (datacontenttype, dataschema, subject, time) and allows extension attributes for domain or infra metadata. (See the spec: https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md)

Example

Following example uses Binary Binding. Binary binding separates CloudEvents attributes into headers and the data into the body, allowing metadata access without unpacking potentially encrypted payloads.

Headers:

ce-specversion: 1.0
ce-id: b7f8e0c4-3c2a-4f2f-9c7a-1e6f4e2a7f8a
ce-source: /services/customer
ce-type: com.example.customer.updated.v1
ce-subject: customer-123
ce-time: 2025-11-01T12:34:56Z
ce-datacontenttype: application/json
ce-dataschema: https://schemas.example.com/customer/v1.0.json
ce-producer: customer-service
ce-team: billing

Body:

{
"schemaVersion": "v1",
"aggregate": {
"id": "customer-123",
"name": "Alice Example",
"status": "active",
"lastOrder": 9876
}
}

CloudEvents as a queryable data product

CloudEvents can act as a primary data product: a log of complete aggregate snapshots (versioned, point-in-time) of your aggregates. When you publish aggregate snapshots (or well-defined deltas with rehydration rules) into a topic and persist them into a data warehouse (e.g. BigQuery), you get a queryable event-log that can answer state-at-time and reconstruction queries.

CloudEvents attributes play a key role in filtering and routing events efficiently, allowing consumers to subscribe to specific event types, sources, or subjects without needing to inspect the payload.

The event log is a time-ordered sequence of versioned snapshots. Each event contains subject (aggregate id), time, type, dataschema, and data (the snapshot).
To find the latest state of an aggregate, query by subject and take the latest event by time (or use offsets/sequences if available).
To compute a point-in-time snapshot, filter events by time <= T and pick the latest per subject as of T.

Best practices and guidance

Below are practical recommendations for implementing CloudEvents as data products and for making events more useful to LLMs and agentic workflows, based on my experiences.

Attributes

id: Unique identity of the event, used for deduplication and non-repudiation. Ensures events can be processed idempotently and provides an immutable reference for auditing.
source: Specifies the implementing system. Per Conway’s Law, system boundaries often reflect team structures, so the solution space (systems) may not perfectly align with the problem space (domains). Use a consistent naming convention to map systems to domains where possible.
type: Provides context of what and why of the change.
Syntax: com.<organisation>.<sub-domain>.<domain service>.<version>.<aggregate>.<event>
Example: com.example.sales.shoppingcart.v1.cart.updated
Alternative pattern: org.domain[.sub-domain].domain-service.aggregate.event
Examples:
com.example.insurance.claims.claim.ClaimCreated.v1
org.acme.billing.customer.CustomerUpdated.v1
Guidelines: Follow a reverse-DNS prefix to avoid collisions and include the aggregate name and the domain event. Remember: events reflect occurrences — prefer naming that expresses the domain-level occurrence and ensure the event corresponds to an aggregate-level change.
Versioning: Include a major-version marker in the type for breaking changes (e.g., com.example.cart.cart.CartCreated.v1). Use dataschema for minor revisions (e.g., https://schemas.example.com/cart/v1.2.json). Prefer major-versioning in type and minor in dataschema.
subject: Use the entity id of the aggregate root in subject. The subject attribute is ideal for subscription filtering and fast routing when intermediaries cannot inspect data. Acts as the primary key for entity-level queries, enabling efficient partitioning and indexing in data warehouses.
time: Time when the event occurred. Critical for non-repudiation, auditability, and handling out-of-order events by consumers. For data products, it sequences events chronologically for accurate state reconstruction.
data: Use a complete snapshot of the aggregate owned by the subdomain. Avoid including data not mastered by your domain, as you won’t publish events when external data changes, violating data quality and accuracy. Ensure the payload is versioned and schema-compliant for reliable querying and AI consumption.
Extensions (e.g., traceparent, tracestate): Include distributed tracing context for end-to-end observability. traceparent and tracestate follow W3C standards, allowing correlation across services. Additional extensions like partitioningkey can optimize message routing, and sequence ensures ordered processing within a partition.

Distributed tracing

Link events to traces and spans. Add W3C-compatible tracing fields (for example traceparent and tracestate) as CloudEvents extension attributes or use a documented tracing extension. This allows you to correlate an event to a trace and surface richer execution context to agents or observability tools. Keep the tracing footprint small (ids only) and avoid embedding full trace dumps in events.

Data payload (aggregate events)

Follow the aggregate event model discussed in the “fat vs thin” post. Prefer publishing an aggregate snapshot (versioned) in data when the aggregate is the source of truth for the change. If you choose thin events (deltas), make sure their schema and rehydration rules are well documented. Important: avoid referencing data that your domain service does not master — if you publish references to other domains’ data that you don’t own, consumers may assume you will publish updates for those referenced fields (but you won’t), which creates confusion and brittle integrations.

Binding mode

Prefer binary binding over structured binding. Binary binding keeps the data payload opaque (e.g., encrypted), making all metadata (context attributes) accessible for routing and filtering without unpacking. This aligns with the “need to know” principle and facilitates least privilege access, as intermediaries can route based on envelope fields without decrypting sensitive data.

Event semantics

An event should represent something that actually occurred in the domain and should map to a change at the aggregate level (the aggregate root). If you find yourself emitting events about loosely-related or derived entities frequently, consider whether the aggregate boundaries are correct or whether a curated/target-aligned product is a better interface.

Reliable event publishing

Ensure event publishing is reliable, as data quality is critical for data products. Consider these patterns:

Outbox Pattern

Store events in an outbox table within the same transaction as the business logic, then use a separate process to publish them. This ensures events are persisted reliably and prevents loss during failures.For example, in a shopping cart service, the outbox table includes columns for all CloudEvent core attributes, with extensions stored as a JSON blob:

BEGIN TRANSACTION;

-- Business logic: Update the cart aggregate
UPDATE carts SET total = total + 10 WHERE id = 'cart-123';

-- Store the event in the outbox with CloudEvent attributes
INSERT INTO outbox (
id, source, specversion, type, datacontenttype, dataschema, subject, time, data, extensions
) VALUES (
'evt-456',
'cart-service',
'1.0',
'cart.updated',
'application/json',
NULL,
'cart-123',
CURRENT_TIMESTAMP,
'{"total": 50}',
'{"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01", "tracestate": "vendorname=opaque-value"}'
);

COMMIT;

A background process then reads from the outbox table, publishes the events to the message broker, and marks them as published.

Durable execution engines

Use frameworks like Temporal for event publishing workflows, providing built-in retries, timeouts, and state management to handle complex, reliable executions.

End-to-End Picture

The following diagram illustrates the end-to-end flow outlined in the guidance above.

Shopping cart example

Assume you persist CloudEvents into a BigQuery table named events.cart_events. Here is the sample CloudEvents in binary binding (attributes in headers, data in body):

Headers:

ce-specversion: 1.0
ce-id: cart-update-456
ce-source: /services/cart
ce-type: com.example.cart.CartUpdated.v1
ce-subject: cart-123
ce-time: 2025-11-01T10:00:00Z
ce-datacontenttype: application/json
ce-dataschema: https://schemas.example.com/cart/v1.0.json
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendorname=opaque-value

Body:

{
"aggregate": {
"id": "cart-123",
"items": [ { "sku": "sku-1", "qty": 2 }, { "sku": "sku-2", "qty": 1 } ],
"total": 59.97,
"currency": "USD"
}
}

Get the latest snapshot of every cart (BigQuery):

SELECT t.subject, t.data.aggregate as latest
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY subject ORDER BY time DESC) as rn
FROM `project.dataset.events_cart_events`
) t
WHERE t.rn = 1;

Get the snapshot of a single cart (cart-123) as of a point in time TIMESTAMP('2025-10-01T12:00:00Z'):

SELECT data.aggregate
FROM `project.dataset.events_cart_events`
WHERE subject = 'cart-123'
AND time <= TIMESTAMP('2025-10-01T12:00:00Z')
ORDER BY time DESC
LIMIT 1;

Filter events by type (e.g., only ‘CartUpdated’ events) for targeted analysis:

SELECT * FROM `project.dataset.events_cart_events`
WHERE type = 'com.example.cart.CartUpdated.v1';

These queries assume events are stored in a single table. For large scale, consider partitioning by date and sharding by domain or source.

Login system example

Consider a login system publishing CloudEvents for authentication flows. Multiple events in a single login session share the same trace ID, allowing correlation across services.

Sample CloudEvents in binary binding:

Login attempt event:

Headers:

ce-specversion: 1.0
ce-id: login-attempt-789
ce-source: /services/auth
ce-type: com.example.auth.LoginAttempt.v1
ce-subject: user-456
ce-time: 2025-11-01T10:00:05Z
ce-datacontenttype: application/json
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Body:

{
"username": "john.doe",
"ip": "192.168.1.1"
}

Authentication success event:

Headers:

ce-specversion: 1.0
ce-id: auth-success-790
ce-source: /services/auth
ce-type: com.example.auth.Authenticated.v1
ce-subject: user-456
ce-time: 2025-11-01T10:00:10Z
ce-datacontenttype: application/json
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Body:

{
"sessionId": "sess-123",
"token": "jwt-token-here"
}

To associate authentication information with a login session, query events by trace ID and extract relevant data:

SELECT
type,
JSON_EXTRACT(data, '$.username') as username,
JSON_EXTRACT(data, '$.ip') as ip,
JSON_EXTRACT(data, '$.sessionId') as session_id,
JSON_EXTRACT(data, '$.token') as token
FROM `project.dataset.events_auth_events`
WHERE JSON_EXTRACT(extensions, '$.traceparent') = '00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01'
ORDER BY time ASC;

This query correlates all events in the authentication flow, associating the login attempt details with the successful authentication outcome.

To find cart updates associated with a login session, join cart events with authentication events using the shared trace ID:

SELECT
cart.subject as cart_id,
cart.time as cart_update_time,
JSON_EXTRACT(cart.data, '$.total') as cart_total,
auth.type as auth_event_type,
JSON_EXTRACT(auth.data, '$.username') as username,
JSON_EXTRACT(auth.data, '$.ip') as login_ip,
JSON_EXTRACT(auth.data, '$.sessionId') as session_id
FROM `project.dataset.events_cart_events` cart
JOIN `project.dataset.events_auth_events` auth
ON JSON_EXTRACT(cart.extensions, '$.traceparent') = JSON_EXTRACT(auth.extensions, '$.traceparent')
WHERE cart.type = 'com.example.cart.CartUpdated.v1'
AND auth.type = 'com.example.auth.Authenticated.v1'
ORDER BY cart.time ASC;

This join reveals which user (via username and IP) performed cart updates in a session, enabling audit trails and user behavior analysis across domains.

Design notes and implementation guidance

1) Schema versioning and registry

Keep your aggregate schema versioned (see schemaVersion above). Publish schemas to a registry or catalog so consumers can discover and validate payloads. Versioning lets producers evolve shape while giving consumers time to migrate.

2) Idempotency and deduplication

Use the id and a durable offset/sequence in your consumer logic to ensure idempotent processing. CloudEvents provides the envelope fields required for reliable processing.

3) Contract as first-class artifact

Treat the CloudEvent envelope + data schema as the contract for a data product. Publish the contract in documentation, automated tests, and CI checks. Provide client libraries or templates to generate events consistently where useful.

4) Observability and lineage

Emit metadata (traces, producer and pipeline ids) so downstream systems can attribute, debug, and compute lineage. This is essential for quality scoring and governance.

5) Tooling and ergonomics

Use CloudEvents SDKs and platform integrations (Kafka, Pub/Sub, HTTP) to reduce boilerplate. A small upfront investment in tooling pays off in adoption.

Building guiderails: SDKs, pipelines and conventions

To make CloudEvents a reliable data product across teams, provide guiderails: SDKs, conventions, and a small ingestion pipeline.

SDKs: recommend and publish official CloudEvents SDKs for key runtimes (e.g., .NET, Java, Node.js, Python) so teams publish consistent envelopes. Provide lightweight wrappers/templates for your standard fields (source, team, dataschema, trace ids).
Ingestion pipeline: have a common pipeline that consumes events from messaging topics and writes them to your data mesh (for example BigQuery datasets). Options:
Use a managed connector (e.g., Google Cloud Pub/Sub -> BigQuery Dataflow template) to stream events directly into BigQuery for quick onboarding.
Or run a custom Dataflow/Beam/Flink pipeline that consumes any topic, validates CloudEvent envelopes and dataschema, and writes to domain-specific datasets/tables.
Conventions for dataset/table layout:
Use source (or a normalized domain identifier derived from source) as the dataset or schema name to group events by producer/domain.
Use subject as the logical entity id (aggregate root) and partition or cluster on it for efficient queries.
Use type and dataschema to manage schema evolution: type indicates major version semantics, dataschema points to the concrete schema (minor versions).
Routing rules: implement a small router that reads topic messages, validates CloudEvent context, and writes them to the appropriate dataset/table based on source and type. Emit logs and metrics for schema validation failures and unknown types so teams can onboard.
Tests and CI: include a small schema validation job in CI that checks sample events against the published dataschema (JSON Schema or Avro). Fail builds when breaking changes are introduced without a new major type.

Putting it together, your pipeline can consume any event topic and, using the CloudEvent envelope conventions (source, subject, type, dataschema), place events into domain datasets and make them queryable as first-class data products.

How this helps LLMs and agentic workflows

LLMs and agentic systems perform better when they operate over grounded, traceable, and semantically rich data.

Grounding and context: Events provide the why/when/where for a fact — improving retrieval relevance and reducing hallucination risk.
RAG pipelines: instead of dumping everything into a vector store, expose curated data products (with embeddings, semantic indexes, and metadata) that RAG pipelines can consult with confidence.
Agents as producers and consumers: Agents will create derived artifacts (summaries, embeddings, datasets) and also query domain products. A data-mesh-like model (domain-owned products with contracts and policies) fits naturally.

Practical example: answering a business question

When an agent asks, “What’s the average claim size for auto insurance this quarter?”, it should query a domain-owned data product (or a materialised view derived from CloudEvents) with lineage and a quality score — not an opaque aggregated dump.

Challenges and considerations

While CloudEvents as data products offer significant benefits, there are challenges to consider:

Adoption friction: Teams accustomed to traditional data products may need to shift mindsets toward event-driven approaches. Training and tooling can help.
Performance overhead: Persisting and querying large volumes of events requires robust infrastructure. Use partitioning, indexing, and data lifecycle management.
Schema evolution: Even with versioning, evolving schemas across producers and consumers can be complex. Invest in schema registries and automated testing.
Security and compliance: Events may contain sensitive data; ensure encryption, access controls, and compliance with regulations like GDPR.

Addressing these proactively ensures successful implementation.

Closing thoughts

CloudEvents is not a silver bullet, but it is a pragmatic, low-friction envelope that makes events easier to adopt as first-class data products. Combine a simple envelope, a clear data contract, schema versioning, and a mindset of domain ownership, and your events become powerful, reusable building blocks for analytics, ML, and agentic systems.

References

CloudEvents spec: https://cloudevents.io/
CloudEvents v1.0.2 spec (detailed): https://github.com/cloudevents/spec/blob/v1.0.2/cloudevents/spec.md
Events on the outside vs events on the inside: https://codesimple.blog/2021/03/14/events-on-the-outside-vs-events-on-the-inside/
Events — fat or thin: https://codesimple.blog/2019/02/16/events-fat-or-thin/

Glossary / Terminology

Aggregate: A cluster of domain objects that can be treated as a single unit for data changes. The aggregate root is the single entry point for consistency and transactional boundaries.
Aggregate root: The entity within an aggregate that is responsible for maintaining invariants and is the authoritative identity for the aggregate.
Entity: An object with an identity that persists over time.
Value Object: An immutable object described by its attributes rather than an identity.
Domain Event: A statement that something of significance happened in the domain (expressed as an event). In DDD, domain events represent facts about the domain and are often the source for integration events.
Bounded Context: A boundary in which a particular domain model applies. Names and semantics inside a bounded context are consistent; different contexts may use the same terms differently.
Ubiquitous Language: The shared language used by domain experts and developers within a bounded context.
Data Product: A discoverable, documented, and governed data interface (in this case, an event stream or materialised view) provided by a domain team.
Schema Registry / Catalog: A place to publish and discover dataschema URIs and schema versions.
RAG: Retrieval-Augmented Generation — an approach for LLMs that combines retrieval of context with generative models.
Data Mesh: An organisational approach that treats datasets (or data products) as domain-owned products with contracts and governance.