Streaming IoT and event data into Snowflake and ClickHouse

Networks were previously used only to connect computers and servers (and a few peripherals), until the Internet of Things (IoT) emerged. Now we can connect sensors, appliances, vehicles, and wearables. This shift created a high demand for continuous, real-time communication driven by event data. As IoT matured, complex event processing (CEP) frameworks became more accessible and gradually drove adoption across healthcare, manufacturing, and agriculture.

Today, the challenge isn’t capturing data but choosing the right tools to extract business value as efficiently as possible. ClickHouse handles real-time analysis of short-lived event data well, while [Snowflake](https:…

However, for compliance-sensitive workloads, you might be better off reversing this setup. Snowflake’s enterprise-level data governance makes it well-suited for streaming access logs and financial triggers, while ClickHouse’s compression capabilities are great for long-term storage and analysis.

This post serves as your guide to using Redpanda and Redpanda Connect as the intermediary between ClickHouse and Snowflake, giving you a solid foundation for building robust data pipelines for IoT and event data.

The importance of reliable real-time IoT and event data streaming

IoT systems are now deeply entrenched in modern technology and business. Their integration has become so essential that industries rely on them as a fundamental part of their real-time operations. For instance, in healthcare, wearables and smart medical equipment are used for remote patient monitoring. Any hiccups or delays in event streaming could negatively impact patient outcomes.

Other examples of real-time IoT and event data streaming include:

Logistics: Companies use event data streaming to produce real-time insights about shipments. This can be critical in tracking and managing the routing of temperature-sensitive goods, such as pharmaceuticals, food, and other perishable products.
Manufacturing: From predictive maintenance to advanced monitoring systems, to the control of robots and machines on the factory floor, manufacturers use IoT in many aspects of their operations. For example, some operational triggers can activate production lines when inventory levels drop beneath a certain threshold. When IoT is deployed to manufacturing environments, it’s often referred to as the Industrial Internet of Things (IIoT).
Smart cities: IoT facilitates vital infrastructure and service delivery functions, such as waste, energy, and traffic management. Many (if not all) of these services rely on IoT-facilitated monitoring, anomaly detection, and operational triggers.

Because the event data that drives these systems is only valuable when it’s fresh, IoT architecture must be built in a way that prevents data from becoming stale. This requires data pipelines that optimally process, integrate, and deliver crucial information.

Freshness is the time between when an event occurs and when it is written to and available for querying within a system. Various use cases have different requirements for how fresh data has to be. In the case of IoT it could be as lax as every few minutes or as strict as subsecond times.

What a modern streaming pipeline looks like

Let’s take a closer look at how modern pipelines look, along with the services and platforms that drive them.

The average modern streaming/data pipeline consists of five parts:

Data source(s): Consists of event data producers, such as sensors, applications, and social media feeds
Ingestion layer: Collects, ingests, and buffers the event data
Processing layer: Performs tasks such as filtering, aggregating, enriching, and transforming the data
Storage layer: Pools and stores the processed data in a system that is optimized for real-time access
‍Output and integration: An analytics tool, dashboard, business application, or integration tool queries and uses (or downstreams) the processed data

Structure of a modern streaming pipeline

Some tools can be used in more than one phase of your pipeline. For example, engineers often use Snowflake as part of a storage layer. However, it can also act as an event data source through its event tables feature. You could query this data, stream it using Redpanda, and then save it for long-term storage on your ClickHouse data warehouse.

Using ClickHouse and Snowflake as two parts of your storage layer and simultaneously synchronizing data across both offers a more straightforward implementation and provides your pipeline with some modularity. Instead of using one to feed the other, ClickHouse can archive data on a local machine or persistent cloud storage (for experimentation, debugging, or troubleshooting). At the same time, Snowflake provides dashboards and real-time alerts.

For this to work, you must consider how your system will fetch and ingest the data from its event source. This involves setting up reliable ingestion pipelines and integrations that let Snowflake and ClickHouse work together to support your real-world use cases.

How to stream IoT and event data into Snowflake and ClickHouse

Let’s say you’re building a system for a spin class instructor who needs real-time exertion or health data to ensure participants aren’t going too hard (or easy). Or perhaps an app for a factory foreman or supervisor responsible for ensuring that automated factory machinery operates as intended.

These use cases might look pretty different on the surface, but they can be built on the same basic IoT blueprint. Let’s walk through how an actual implementable solution would function.

Step 1. Ingest IoT data with Redpanda

Regardless of your IoT use case, you can use Redpanda to ingest all the necessary device data and transform it into something useful. Redpanda is built to handle massive amounts of data flowing simultaneously from numerous sensors or devices without slowing down or bottlenecking. It won’t have trouble streaming, ingesting, and handling second-to-second heart rate data from multiple fitness trackers or sensors at once.

Because Redpanda’s low-latency architecture is optimized for speed, data is ingested and made available for processing almost instantly. This means the spin class instructor will be able to quickly identify anyone who is struggling (or may be suffering health complications) in their class.

Redpanda is compatible with Kafka’s API, allowing you to easily integrate it into systems and applications already built for Apache Kafka®. This also enables you to effortlessly migrate from Kafka to Redpanda without needing to modify your existing Kafka-based pipelines or applications first.

It also significantly lowers the learning curve because you can reuse and transfer the approaches and libraries you used for Kafka development to your Redpanda development workflow. For instance, a Java producer that uses the kafka-clients library would work with Redpanda.

Step 2. Integrate sources with Redpanda Connect

You can use Redpanda to stream and ingest event data from the devices, but you still need a way to organize and output it into your storage layer. Redpanda Connect provides a streamlined and intuitive way to develop scalable, high-performance data pipelines. It offers a wide range of prebuilt connectors (and custom ones), which means you’re not limited to Redpanda or Apache Kafka as inputs.

Redpanda Connect also supports parallel outputs using a broker or multiplexing. This allows you to simultaneously stream to multiple outputs or route streams/messages to each based on rules or conditions you’ve set.

Step 3. Configure Snowflake for IoT data

Redpanda Connect supports Snowflake integration through its snowflake_streaming connector. It uses Snowpipe Streaming to facilitate continuous data ingestion, loading data into Snowflake tables in microbatches. But before setting that up, you’ll need to create a Redpanda topic as the source of your data stream.

A barebones Redpanda Connect template for this pipeline would look like this:

input:

kafka_franz:

seed_brokers:

- localhost:9092 # Replace with your Redpanda/Apache host and port

topics:

- my-topic # Replace with your topic



pipeline:

processors:

- mapping: |

root.message = this

root.meta.timestamp = now()



output:

snowflake_streaming:

account: "ORG-ACCOUNT" # Replace with your Snowflake account identifier

user: "STREAMING_USER" # Replace with your Snowflake user

role: "ACCOUNTADMIN" # Replace with the appropriate role

database: "MYDATABASE" # Replace with your Snowflake database name

schema: "PUBLIC" # Replace with your Snowflake schema

table: "MYTABLE" # Replace with your Snowflake table name

schema_evolution:

enabled: true # Enable schema evolution for dynamic updates

batching:

count: 1000 # Number of records per batch

byte_size: 0 # The amount of bytes at which the batch is flushed

period: "10s" # Batch interval

Note: You can use the kafka_franz (or kafka) and redpanda input connectors interchangeably for Redpanda streams.

With this template, you first define the Redpanda connection details and topic to listen to under the input field. This is essentially where your IoT sensors and devices will stream data to. The pipeline portion of the configuration file allows you to define processors and map some of the data from the Redpanda stream. You can also use this section to transform some of the data before publishing it to the output. For instance, you can filter, reformat, or change the case of text in the data.

You can consult the Redpanda Connect component documentation to learn more about all the available configuration fields and their purpose.

The final section (output) defines the Snowflake (Snowpipe) settings. If you’re familiar with database management tools like Microsoft SQL Server, you’ll likely recognize many of these options (account, user, role, etc.).

The schema_evolution and batching properties require the most attention as they’ll have the biggest impact on the output database stream’s performance. Because you’re using Snowflake for time-series data, you’ll need to adjust these fields accordingly.

Step 4. Optimize your schema design for time-series data

Like most data storage platforms, Snowflake allows you to alter its performance profile through its schema design. Whereas older relational databases (MySQL, PostgreSQL) and data warehouses required explicit schema modifications, Snowflake can automatically adjust table structures to match changes in incoming streams without manual intervention.

This is known as schema evolution (schema_evolution). It allows Snowflake to do things like automatically add columns or drop constraints, which reduces downtime and effort required from administrators.

However, if your data’s structure is unlikely to change, turning this feature off may serve you better. It forces schema validation, helping you catch errors or inconsistencies in the incoming data before they are stored. Because continuous schema changes can add overhead during ingestion or query processing (especially with large-scale data pipelines), schema evolution may not be ideal in time-series contexts, where performance and retrieval speeds are critical.

Snowpipe streams new data in a set of batches (as opposed to individually) to optimize resource usage. Smaller batches (500 to 1,000 records) are ideal for low-latency streaming, while larger batches (over 10,000 records) are best suited for bulk processing. If you’re using Snowflake for time series data, it’s best to set this field to 1,000 (at most). byte_size allows you to define the batch size. While Snowflake recommends 100–250 MB for data files, it’s best to set this property to 0 to turn off size-based batching to simplify your configuration.

The period should range between ten to thirty seconds for real-time ingestion for dashboards and analytics. You can extend it to one to five minutes for less frequent updates. Other properties you can add include the following:

max_in_flight: Allows you to define the maximum number of messages in transit
build_options: Provides you with options to optimize the build of Snowflake output data, mainly the maximum number of parallel processes and the chunk size for processing

Note that even with these enhancements, ClickHouse remains a better fit for latency-sensitive workloads such as gaming, high-frequency trading, or fraud detection.

Step 5. Use ClickHouse for long-term retention

ClickHouse allows you to build high-performance, columnar OLAP databases. Its fast query speeds make it ideal for handling time-series data and event-driven architecture. However, it can also act as a high-efficiency, long-term storage engine that balances compression, retention, and queryability. Unlike traditional cold storage systems that require rehydration or ETL to access historical data, ClickHouse lets you keep archived data online and instantly queryable - even as your rows expand into the millions.

ClickHouse’s Table Partitioning is especially useful for storing and cataloging historical data. It enables fast pruning and detached archival, where old partitions can be physically detached from the active dataset and moved to external storage (like S3 or NAS). You can access ClickHouse’s table partitioning features natively through the PARTITION BY clause. PARTITION BY toYYYYMM(timestamp) or toDate(timestamp) will allow you to segment data monthly or daily:

CREATE TABLE telemetry_events (

timestamp DateTime,

device_id String,

value Float64

) ENGINE = MergeTree()

PARTITION BY toYYYYMM(timestamp);

The above code creates a table named telemetry_events (using ClickHouse’s MergeTree table engine) that groups its rows into monthly partitions like 202501 or 202502. You can then detach old partitions using the following statement:

ALTER TABLE telemetry_events DETACH PARTITION '202501';

ClickHouse also supports native TTL policies, which allow the automatic deletion or movement of old data. For instance, if you wanted to delete rows older than 12 months, you would use this statement:

ALTER TABLE telemetry_events

MODIFY TTL timestamp + INTERVAL 12 MONTH DELETE;

With Columnar compression, you can pack massive datasets into way less disk space. Just use the codec clause at the column level when you’re setting up or updating your table:

ALTER TABLE telemetry_events

MODIFY COLUMN value Float64 CODEC(ZSTD);

Pro tip: Use ClickHouse’s S3-based hybrid storage to optimize performance while keeping costs low. Your most recent data can be compressed using a lightweight codec, such as LZ4, and stored on local SSDs, while older data can be aggressively compressed and offloaded to S3-backed storage. Keep in mind that higher compression levels reduce storage footprint but increase data access latency. For this reason, they’re best reserved for long-term archival tiers where data is infrequently queried.

All of these features showcase how ClickHouse handles schema evolution. Its model is designed for append-only, analytical workloads where schema changes are infrequent but necessary for long-term flexibility and scaling.

Once you’ve optimized your ClickHouse database for long-term data storage, it has a wide range of uses:

Historical sensor trends: You can design a schema that captures time-series data in ClickHouse and then execute a set of SQL commands to aggregate, filter, and visualize patterns across days, months, or years.
ML training data: You can use ClickHouse to store raw or preprocessed features in a columnar table, then use T-SQL to extract labeled datasets, perform aggregations, and join across sources. It would be best to partition the table by time and order it by the entity ID or timestamp column.
Audit logs: ClickHouse allows you to build strong tamper-resistance into your audit log or telemetry pipeline by combining its append-only storage model with external integrity controls. Since ClickHouse doesn’t support row-level updates or deletes (except via TTL or mutations), it’s naturally resistant to silent data changes. For stronger guarantees, you can implement cryptographic hashing on inserted data, store those hashes in a separate immutable store, and monitor for divergence.

While there isn’t a dedicated ClickHouse connector for Redpanda Connect yet, the sql_raw and sql_insert components allow you to stream execute commands or stream data from Redpanda into ClickHouse. They’re available as input, processor, and output types, so you’ve got flexibility in how you wire things up.

You can easily set your pipeline to stream data into both Snowflake and ClickHouse simultaneously, without a broker. You just need to pair the sql_raw or sql_insert processor with the snowflake_streaming output connector.

Step 6. Stream and analyze event data in real time

Using a combination of Redpanda, Snowflake, and ClickHouse can give you multiple options for sending and analyzing event data in real time. Similar to your IoT streaming workflow, Redpanda would act as the source for your event data, while both Snowflake and ClickHouse provide the means to process it in real-time. You could also use either solution for long-term storage.

How this architecture differs from the one explored earlier is that, instead of a continuous data stream, you could use Redpanda to send event data as messages. These would then be persisted in your storage layer.

Step 7. Redpanda’s role in event data streaming

Just like Apache Kafka, Redpanda can act as a message broker. The key difference here is that Redpanda offers a more lightweight deployment. It’s packaged as a single binary, eliminating the need for additional components, like ZooKeeper. This simplifies deployments and reduces operational overhead, which in turn reduces overall hardware requirements and makes it compatible with more devices—particularly those in IoT and edge computing.

Redpanda’s support for idempotent producers and transactional messaging ensures that messages are delivered exactly once and in the correct order. It’s optimized for high-throughput and low-latency streaming, allowing it to scale easily for high-volume digital data packets.

Step 8. Process real-time events with Snowflake

Snowflake’s Snowpipe Streaming enables querying event streams and messages within seconds of ingestion. It can ingest data row by row directly into Snowflake tables, bypassing the need for staging files, while the change data capture (CDC) feature streams alterations (inserts, deletes, and updates) to these tables in real time, enabling you to view changes immediately.

Snowflake comes with a host of useful SQL tools. For example, the MATCH_RECOGNIZE clause allows you to identify sequences of rows that match a specific pattern. You can also use this clause for sessionization by defining patterns that represent session boundaries.

For instance, let’s say you were writing an implementation to track clicks on a website. Each click triggers a Redpanda data stream or message exchange that’s then sent and stored in a Snowflake table. Your goal is to track how stimulating the website is, so you use sessionization to group consecutive clicks. In this scenario, clicks are considered a part of the same session if the time difference between them is less than or equal to ten seconds and they’re performed from the same IP address. Your query would resemble this:

SELECT

ip,

session_id,

start_time,

no_of_events,

duration

FROM

clicks

MATCH_RECOGNIZE (

PARTITION BY ip

ORDER BY click_timestamp

MEASURES

MATCH_NUMBER() AS session_id,

COUNT(*) AS no_of_events,

FIRST(click_timestamp) AS start_time,

LAST(click_timestamp) - FIRST(click_timestamp) AS duration

ONE ROW PER MATCH

PATTERN (b s*)

DEFINE

s AS (s.click_timestamp - PREV(s.click_timestamp)) <= INTERVAL '10' SECOND

);

You can further alter and refine queries like the above using Snowflake’s wide range of SQL aggregation functions, such as SUM, AVG, COUNT, and MAX.

Step 9. Historical event analysis with ClickHouse

Through a combination of smart storage design, efficient indexing, and aggressive pruning, ClickHouse allows you to query across months or years of data with sub-second latency. It uses SIMD (Single Instruction, Multiple Data) to process data in batches. This speeds up filtering, aggregation, and joins. ClickHouse is also optimized for CPU cache locality, making it ideal for analytical queries.

Its rich collection of metadata features can help you understand, manage, and optimize your tables, partitions, columns, and query performance. These features are handy for long-term storage, schema evolution, and historical analysis. Internal ClickHouse metadata is exposed using System tables. You can use them to find historical query metadata, column types, and codecs, live queries, and information on which parts of your datasets were moved to cold storage.

ClickHouse stores data in parts, each containing granules (blocks of rows). For each granule, it stores the minimum value of the ORDERER BY key. When you execute a query with a filter, ClickHouse uses the index to skip granules that don’t match. This is called MergeTree indexing, and it’s why ClickHouse can query across months or years of telemetry, logs, or CI/CD events in milliseconds.

You can control how often index entries are written using index_granularity settings. Smaller granules mean finer skipping but more overhead.

Conclusion

This guide explored how to stream IoT and event data into Snowflake and ClickHouse using Redpanda. It examined how event data or continuous streams could be ingested using Redpanda and then processed using Redpanda Connect before being stored for short-term (Snowflake) and long-term analysis (ClickHouse).

Although the guide offers repeatable, real-world suggestions and pathways for implementing such a system yourself, these tools are flexible enough to foster some creativity. For instance, you could use Redpanda Connect to easily build a linear data pipeline or multiplex data across multiple storage platforms.

However, the journey doesn’t begin (or even end) at the storage. It begins at the event source. Your IoT and event streaming architecture deserves a solid foundation, so try Redpanda today.