Why Do Data Pipelines Need Streaming — Isn’t Batch Processing Enough?

About this series

This time, I’m taking on a 30-day challenge to focus on streaming pipelines. Following the idea of “knowing not just what works, but why it works,” we’ll start by building a simple streaming framework from scratch. This will help us understand core concepts like event-driven processing, state management, and windowing. From there, we’ll explore industry-standard systems such as Flink and RisingWave.

Think of it as “building a small wheel before driving a big truck.” The goal isn’t just to know how to use the technology, but to truly grasp why it matters. Thirty days may not be enough to write an epic, but it’s enough to gain a complete mindset for Streaming.

About this series

Why focus on “knowing not just what, but why”?

We often rely on powerful frameworks to solve problems, but have you ever wondered:

Why is this framework designed this way?
If I were designing it, what would I do differently?
When I hit a weird bug, what’s really happening under the hood?

Most of the time, we only know what works (how to use it), but not why it works (the underlying principles).

By writing basic implementations ourselves, we uncover the logic behind the tools. Once you understand the principles, you’ll not only use frameworks more effectively but also debug with greater confidence.

That’s the core motivation of this series: starting from simple code, and step by step, uncovering the design ideas behind complex systems.

With the rise of AI, data-driven no longer just means “lots of data”—it means “data available instantly.” Use cases like recommendation systems, fraud detection, and AI decision engines all require data to be usable the moment it arrives.

That’s why Stream Processing has gone from a “nice-to-have” to a core capability of modern data platforms.

Compared to traditional Batch Processing, Stream Processing offers:

Low latency: process data as it arrives, instead of waiting for batches
Real-time decision-making: enable business actions within seconds

What is a streaming pipeline?

A Streaming Pipeline is the complete architecture for implementing Stream Processing. It covers the full flow—from ingesting data, to processing it, to delivering outputs—so we can build reliable real-time data systems.

Key traits of Stream Processing:

Real-time: processes data as soon as it arrives (latency in seconds or milliseconds)
Continuous: processes an unending stream of events instead of waiting for batches
Event-driven: logic is triggered by events, ideal for real-time decisions
Stateful: maintains processing state for complex aggregations and windowing

Typical streaming pipeline architecture

┌─────────────┐ ┌──────────────┐ │ Data Sources│───▶│ Message Queue│ │ │ │ │ │• App Logs │ │• Kafka │ │• IoT Events │ │• Pulsar │ │• User Acts │ │• RabbitMQ │ │• DB Changes │ │ │ └─────────────┘ └──────────────┘ ┌──────────────────┐ ┌────────────┐ ───────────────────▶│Stream Processing │───▶│Output Sinks│ │Engine │ │ │ │• Flink │ │• Database │ │• RisingWave │ │• Dashboard │ │• Spark Streaming │ │• Warehouse │ │ │ │• Alerts │ └──────┬───────────┘ └────────────┘ ▼ ┌────────────┐ │State Store │ │ │ │• RocksDB │ │• Memory │ │• Checkpoint│ └────────────┘

Core components:

Sources – App logs, IoT sensors, user actions, DB changes
Message queue – Kafka, Pulsar, etc., providing reliable buffering and delivery
Stream processing engine – the heart of the pipeline, executing processing logic (e.g., Flink, RisingWave, Spark Streaming)
State store – maintains state for windowing, aggregations, etc.
Sinks – destinations like databases, warehouses, dashboards, or alert systems

The evolution of stream processing engines

Over these 30 days, I’ll focus on the Stream processing engine, sharing hands-on experience with different systems, their strengths, weaknesses, and trade-offs—from handwritten consumers, to Flink, to RisingWave.

Starting point: Handwritten consumer (Python)

Our team’s very first “real-time report” solution was simply a custom Kafka consumer. Strictly speaking, this wasn’t a Stream Processing Engine—it was just basic message consumption.

consumer = KafkaConsumer(
    'orders',
    bootstrap_servers='localhost:9092',
)
order_count = {}
for msg in consumer:
order = json.loads(msg.value)
merchant = order[‘merchant_id’]
order_count[merchant] = order_count.get(merchant, 0) + 1

It worked, but everything like Kafka offsets, error handling, and state management had to be done manually. It lacked core Stream Processing Engine features like fault tolerance, state management, and windowing.

Next step: Flink (PyFlink + Flink SQL)

When we introduced Flink, a mature Stream Processing Engine, things became much more professional.

Flink handled Kafka offsets, watermarks, checkpoints, and more—making large-scale streaming much more reliable.

from pyflink.table import EnvironmentSettings, TableEnvironment
t_env = TableEnvironment.create(
EnvironmentSettings.in_streaming_mode()
)
t_env.execute_sql(“”“
CREATE TABLE orders (
merchant_id STRING
) WITH (
‘connector’ = ‘kafka’,
‘topic’ = ‘orders’,
‘properties.bootstrap.servers’ = ‘localhost:9092’,
‘format’ = ‘json’
)
”“”)
t_env.execute_sql(“”“
CREATE TABLE order_summary
WITH (‘connector’=‘print’) AS
SELECT merchant_id, COUNT(*)
FROM orders GROUP BY merchant_id
”“”)

However, Flink comes with a steep learning curve—performance tuning, cluster deployment, and state backend configuration all require significant expertise.

Latest stop: RisingWave (Kafka Source + Materialized View)

More recently, I’ve been using RisingWave, a new-generation Stream Processing Engine, and found that it makes things much simpler.

With RisingWave, complex streaming logic can be expressed directly in SQL—while it automatically manages checkpoints, state, and infrastructure details.

CREATE SOURCE orders (
  merchant_id VARCHAR
) WITH (
  connector = 'kafka',
  topic = 'orders',
  properties.bootstrap.servers = 'localhost:9092',
  format = 'json'
);
CREATE MATERIALIZED VIEW order_summary AS
SELECT merchant_id, COUNT(*) AS order_count
FROM orders
GROUP BY merchant_id;

Compared to Flink, RisingWave’s advantages are clear:

SQL-first: express complex logic directly in SQL
Built-in management of checkpoints and state
Infrastructure details abstracted away
Cloud-native: separation of compute and storage, elastic scaling

For SQL-savvy data teams, the barrier to entry is much lower.

Summary

The journey from handwritten consumers → Flink → RisingWave reflects the overall evolution of Stream processing engines:

More reliable – from manual state management to built-in fault tolerance
More accessible – from complex code to SQL-first development
More democratized – making streaming accessible to all data teams

In the AI era, Stream processing engines are becoming the backbone of modern data platforms, with the trend moving toward lower barriers and faster adoption.

In the upcoming posts, we’ll dive deeper into technical details, trade-offs, best practices, and future trends of Stream Processing Engines—stay tuned!

About the author

Shih-min (Stan) Hsu — Team Lead, Data Engineering at SHOPLINE | Expert in Batch and Streaming Data Pipelines.

About this series

About this series

What is a streaming pipeline?

Typical streaming pipeline architecture

The evolution of stream processing engines

Starting point: Handwritten consumer (Python)

Next step: Flink (PyFlink + Flink SQL)

Latest stop: RisingWave (Kafka Source + Materialized View)

Summary

About the author

Similar Posts