About this series
This time, Iβm taking on a 30-day challenge to focus on streaming pipelines. Following the idea of βknowing not just what works, but why it works,β weβll start by building a simple streaming framework from scratch. This will help us understand core concepts like event-driven processing, state management, and windowing. From there, weβll explore industry-standard systems such as Flink and RisingWave.
Think of it as βbuilding a small wheel before driving a big truck.β The goal isnβt just to know how to use the technology, but to truly grasp why it matters. Thirty days may not be enough to write an epic, but itβs enough to gain a complete mindset for Streaming.
About this series This time, Iβm taking on a 30-day challenge to focus on streaming pipelines. Following the idea of βknowing not just what works, but why it works,β weβll start by building a simple streaming framework from scratch. This will help us understand core concepts like event-driven processing, state management, and windowing. From there, weβll explore industry-standard systems such as Flink and RisingWave. Think of it as βbuilding a small wheel before driving a big truck.β The goal isnβt just to know how to use the technology, but to truly grasp why it matters. Thirty days may not be enough to write an epic, but itβs enough to gain a complete mindset for Streaming. Why focus on βknowing not just what, but whyβ? We often rely on powerful frameworks to solve problems, but have you ever wondered: Most of the time, we only know what works (how to use it), but not why it works (the underlying principles). By writing basic implementations ourselves, we uncover the logic behind the tools. Once you understand the principles, youβll not only use frameworks more effectively but also debug with greater confidence. Thatβs the core motivation of this series: starting from simple code, and step by step, uncovering the design ideas behind complex systems. With the rise of AI, data-driven no longer just means βlots of dataββit means βdata available instantly.β Use cases like recommendation systems, fraud detection, and AI decision engines all require data to be usable the moment it arrives. Thatβs why Stream Processing has gone from a βnice-to-haveβ to a core capability of modern data platforms. Compared to traditional Batch Processing, Stream Processing offers: What is a streaming pipeline? A Streaming Pipeline is the complete architecture for implementing Stream Processing. It covers the full flowβfrom ingesting data, to processing it, to delivering outputsβso we can build reliable real-time data systems. Key traits of Stream Processing: Typical streaming pipeline architecture Core components: The evolution of stream processing engines Over these 30 days, Iβll focus on the Stream processing engine, sharing hands-on experience with different systems, their strengths, weaknesses, and trade-offsβfrom handwritten consumers, to Flink, to RisingWave. Starting point: Handwritten consumer (Python) Our teamβs very first βreal-time reportβ solution was simply a custom Kafka consumer. Strictly speaking, this wasnβt a Stream Processing Engineβit was just basic message consumption. for msg in consumer:
order = json.loads(msg.value)
merchant = order[βmerchant_idβ]
order_count[merchant] = order_count.get(merchant, 0) + 1
It worked, but everything like Kafka offsets, error handling, and state management had to be done manually. It lacked core Stream Processing Engine features like fault tolerance, state management, and windowing. Next step: Flink (PyFlink + Flink SQL) When we introduced Flink, a mature Stream Processing Engine, things became much more professional. Flink handled Kafka offsets, watermarks, checkpoints, and moreβmaking large-scale streaming much more reliable. t_env = TableEnvironment.create(
EnvironmentSettings.in_streaming_mode()
) t_env.execute_sql(βββ
CREATE TABLE orders (
merchant_id STRING
) WITH (
βconnectorβ = βkafkaβ,
βtopicβ = βordersβ,
βproperties.bootstrap.serversβ = βlocalhost:9092β,
βformatβ = βjsonβ
)
βββ) t_env.execute_sql(βββ
CREATE TABLE order_summary
WITH (βconnectorβ=βprintβ) AS
SELECT merchant_id, COUNT(*)
FROM orders GROUP BY merchant_id
βββ)
However, Flink comes with a steep learning curveβperformance tuning, cluster deployment, and state backend configuration all require significant expertise. Latest stop: RisingWave (Kafka Source + Materialized View) More recently, Iβve been using RisingWave, a new-generation Stream Processing Engine, and found that it makes things much simpler. With RisingWave, complex streaming logic can be expressed directly in SQLβwhile it automatically manages checkpoints, state, and infrastructure details. CREATE MATERIALIZED VIEW order_summary AS
SELECT merchant_id, COUNT(*) AS order_count
FROM orders
GROUP BY merchant_id;
Compared to Flink, RisingWaveβs advantages are clear: For SQL-savvy data teams, the barrier to entry is much lower. Summary The journey from handwritten consumers β Flink β RisingWave reflects the overall evolution of Stream processing engines: In the AI era, Stream processing engines are becoming the backbone of modern data platforms, with the trend moving toward lower barriers and faster adoption. In the upcoming posts, weβll dive deeper into technical details, trade-offs, best practices, and future trends of Stream Processing Enginesβstay tuned! About theΒ author Shih-min (Stan) HsuβββTeam Lead, Data Engineering at SHOPLINE | Expert in Batch and Streaming Data Pipelines.
βββββββββββββββ ββββββββββββββββ
β Data SourcesβββββΆβ Message Queueβ
β β β β
ββ’ App Logs β ββ’ Kafka β
ββ’ IoT Events β ββ’ Pulsar β
ββ’ User Acts β ββ’ RabbitMQ β
ββ’ DB Changes β β β
βββββββββββββββ ββββββββββββββββ
ββββββββββββββββββββ ββββββββββββββ
ββββββββββββββββββββΆβStream Processing βββββΆβOutput Sinksβ
βEngine β β β
ββ’ Flink β ββ’ Database β
ββ’ RisingWave β ββ’ Dashboard β
ββ’ Spark Streaming β ββ’ Warehouse β
β β ββ’ Alerts β
ββββββββ¬ββββββββββββ ββββββββββββββ
βΌ
ββββββββββββββ
βState Store β
β β
ββ’ RocksDB β
ββ’ Memory β
ββ’ Checkpointβ
ββββββββββββββ
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
)
order_count = {}
from pyflink.table import EnvironmentSettings, TableEnvironment
CREATE SOURCE orders (
merchant_id VARCHAR
) WITH (
connector = 'kafka',
topic = 'orders',
properties.bootstrap.servers = 'localhost:9092',
format = 'json'
);