About this series
This time, I’m taking on a 30-day challenge to focus on streaming pipelines. Following the idea of “knowing not just what works, but why it works,” we’ll start by building a simple streaming framework from scratch. This will help us understand core concepts like event-driven processing, state management, and windowing. From there, we’ll explore industry-standard systems such as Flink and RisingWave.
Think of it as “building a small wheel before driving a big truck.” The goal isn’t just to know how to use the technology, but to truly grasp why it matters. Thirty days may not be enough to write an epic, but it’s enough to gain a complete mindset for Streaming.
About this series This time, I’m taking on a 30-day challenge to focus on streaming pipelines. Following the idea of “knowing not just what works, but why it works,” we’ll start by building a simple streaming framework from scratch. This will help us understand core concepts like event-driven processing, state management, and windowing. From there, we’ll explore industry-standard systems such as Flink and RisingWave. Think of it as “building a small wheel before driving a big truck.” The goal isn’t just to know how to use the technology, but to truly grasp why it matters. Thirty days may not be enough to write an epic, but it’s enough to gain a complete mindset for Streaming. Why focus on “knowing not just what, but why”? We often rely on powerful frameworks to solve problems, but have you ever wondered: Most of the time, we only know what works (how to use it), but not why it works (the underlying principles). By writing basic implementations ourselves, we uncover the logic behind the tools. Once you understand the principles, you’ll not only use frameworks more effectively but also debug with greater confidence. That’s the core motivation of this series: starting from simple code, and step by step, uncovering the design ideas behind complex systems. With the rise of AI, data-driven no longer just means “lots of data”—it means “data available instantly.” Use cases like recommendation systems, fraud detection, and AI decision engines all require data to be usable the moment it arrives. That’s why Stream Processing has gone from a “nice-to-have” to a core capability of modern data platforms. Compared to traditional Batch Processing, Stream Processing offers: What is a streaming pipeline? A Streaming Pipeline is the complete architecture for implementing Stream Processing. It covers the full flow—from ingesting data, to processing it, to delivering outputs—so we can build reliable real-time data systems. Key traits of Stream Processing: Typical streaming pipeline architecture Core components: The evolution of stream processing engines Over these 30 days, I’ll focus on the Stream processing engine, sharing hands-on experience with different systems, their strengths, weaknesses, and trade-offs—from handwritten consumers, to Flink, to RisingWave. Starting point: Handwritten consumer (Python) Our team’s very first “real-time report” solution was simply a custom Kafka consumer. Strictly speaking, this wasn’t a Stream Processing Engine—it was just basic message consumption. for msg in consumer:
order = json.loads(msg.value)
merchant = order[‘merchant_id’]
order_count[merchant] = order_count.get(merchant, 0) + 1
It worked, but everything like Kafka offsets, error handling, and state management had to be done manually. It lacked core Stream Processing Engine features like fault tolerance, state management, and windowing. Next step: Flink (PyFlink + Flink SQL) When we introduced Flink, a mature Stream Processing Engine, things became much more professional. Flink handled Kafka offsets, watermarks, checkpoints, and more—making large-scale streaming much more reliable. t_env = TableEnvironment.create(
EnvironmentSettings.in_streaming_mode()
) t_env.execute_sql(“”“
CREATE TABLE orders (
merchant_id STRING
) WITH (
‘connector’ = ‘kafka’,
‘topic’ = ‘orders’,
‘properties.bootstrap.servers’ = ‘localhost:9092’,
‘format’ = ‘json’
)
”“”) t_env.execute_sql(“”“
CREATE TABLE order_summary
WITH (‘connector’=‘print’) AS
SELECT merchant_id, COUNT(*)
FROM orders GROUP BY merchant_id
”“”)
However, Flink comes with a steep learning curve—performance tuning, cluster deployment, and state backend configuration all require significant expertise. Latest stop: RisingWave (Kafka Source + Materialized View) More recently, I’ve been using RisingWave, a new-generation Stream Processing Engine, and found that it makes things much simpler. With RisingWave, complex streaming logic can be expressed directly in SQL—while it automatically manages checkpoints, state, and infrastructure details. CREATE MATERIALIZED VIEW order_summary AS
SELECT merchant_id, COUNT(*) AS order_count
FROM orders
GROUP BY merchant_id;
Compared to Flink, RisingWave’s advantages are clear: For SQL-savvy data teams, the barrier to entry is much lower. Summary The journey from handwritten consumers → Flink → RisingWave reflects the overall evolution of Stream processing engines: In the AI era, Stream processing engines are becoming the backbone of modern data platforms, with the trend moving toward lower barriers and faster adoption. In the upcoming posts, we’ll dive deeper into technical details, trade-offs, best practices, and future trends of Stream Processing Engines—stay tuned! About the author Shih-min (Stan) Hsu — Team Lead, Data Engineering at SHOPLINE | Expert in Batch and Streaming Data Pipelines.
┌─────────────┐ ┌──────────────┐
│ Data Sources│───▶│ Message Queue│
│ │ │ │
│• App Logs │ │• Kafka │
│• IoT Events │ │• Pulsar │
│• User Acts │ │• RabbitMQ │
│• DB Changes │ │ │
└─────────────┘ └──────────────┘
┌──────────────────┐ ┌────────────┐
───────────────────▶│Stream Processing │───▶│Output Sinks│
│Engine │ │ │
│• Flink │ │• Database │
│• RisingWave │ │• Dashboard │
│• Spark Streaming │ │• Warehouse │
│ │ │• Alerts │
└──────┬───────────┘ └────────────┘
▼
┌────────────┐
│State Store │
│ │
│• RocksDB │
│• Memory │
│• Checkpoint│
└────────────┘
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
)
order_count = {}
from pyflink.table import EnvironmentSettings, TableEnvironment
CREATE SOURCE orders (
merchant_id VARCHAR
) WITH (
connector = 'kafka',
topic = 'orders',
properties.bootstrap.servers = 'localhost:9092',
format = 'json'
);