How Data Really Travels Over the Network (JSON vs Avro vs Protobuf)

7 min read20 hours ago

–

***Authored by: ***Venkatesh Wagh

Press enter or click to view image in full size

You’ve optimized everything.

Indexes are in place. Queries are tuned. Caches are warm.

And yet… something still feels off.

Network graphs keep climbing. Stored data keeps growing. Payload sizes quietly become the new bottleneck.

Someone whispers, “Can we compress it?” Someone else asks: “Can we represent it more compactly?”

Press enter or click to view image in full size

Optimized Ecosystem with bloated Network

And that’s when it hits: we don’t really know what our data looks like once it leaves memory.

Have you…

7 min read20 hours ago

–

***Authored by: ***Venkatesh Wagh

Press enter or click to view image in full size

You’ve optimized everything.

Indexes are in place. Queries are tuned. Caches are warm.

And yet… something still feels off.

Network graphs keep climbing. Stored data keeps growing. Payload sizes quietly become the new bottleneck.

Someone whispers, “Can we compress it?” Someone else asks: “Can we represent it more compactly?”

Press enter or click to view image in full size

Optimized Ecosystem with bloated Network

And that’s when it hits: we don’t really know what our data looks like once it leaves memory.

Have you ever stopped to really think about what happens when a Java application sends data to Kafka? Or how a Java REST service hands a request List to a Python Flask app—and it somehow becomes a Python object on the other side?

What exactly travels over the wire? And how does the receiving system magically rebuild meaning from it?

Enter serialization

Press enter or click to view image in full size

Serialization Basics

Serialization is the invisible boundary most systems quietly cross.

The moment data leaves memory, it stops being an object. It stops being aList, a Map, or a class instance tied to your language runtime.

To travel across networks, between services, or into storage, data has to become something every system can understand: bytes.

That’s where serialization comes in.

Serialization takes an in-memory representation — a Java object, a Python structure, a map, a list — and converts it into a sequence of bytes that can be:

stored
transmitted
reconstructed elsewhere

On the receiving side, those bytes are interpreted and transformed back into an in-memory structure. This reverse step is called deserialization.

Textbook definition:

In simple terms, serialization is the process of converting data into a format that can be stored or transmitted.

Serialization sits at one of the most critical boundaries in any system:

Memory ↔ Disk
Service ↔ Service
Different programming languages
Today’s code ↔ Tomorrow’s changes

Most frameworks hide this boundary. But as systems scale, or as data outlives the code that produced it, this boundary becomes visible — and the choices you make in serialization start to matter.

serialization = language-neutral representation

Re-iterating: An object can be serialized into many formats: text-based formats like JSON or XML, or more compact binary formats.

Enter JSON

Press enter or click to view image in full size

As soon as data needs to leave your application and be understood by someone — or something — else, engineers reach for JSON.

Why? Because it just works.

Language neutral — Python, Java, JavaScript, Ruby: it doesn’t care.
Human-readable — you can open a file or a network packet and actually make sense of it.
Easy to debug — no compiler required to see what went wrong.
Everywhere — from web apps to Kafka, from MongoDB to Oracle, JSON is already there.

It doesn’t matter if your system is:

a browser rendering a web page
a Java backend service crunching numbers
a Python microservice processing events
a Kafka consumer handling millions of messages
a database storing documents
or a REST API exchanging payloads with the world

JSON fits. Always.

Over time, engineers stopped asking questions and started assuming it: JSON became the universal language of data.

Web request/response flows? JSON.
Kafka messages? Often JSON.
Configuration files, logs, documents? JSON again.

It’s verbose. It’s readable. It is everywhere, and that ubiquity makes it hard to ignore.

Have you wondered how Json is sent over wire? (UTF-8 Encoding)

UTF-8 converts each character in the JSON text into a sequence of bytes. For simple ASCII characters (letters, digits, punctuation), 1 character = 1 byte.

Example byte mapping for part of the JSON:

Press enter or click to view image in full size

UTF-8 Conversation for each character (7B 22 69 64 22 3A 31 2C 22 6E 61…)

JSON is text, not bytes.

What is UTF-8?

UTF-8 is a character encoding.

In simple terms, it defines how text characters are converted into bytes.

Computers don’t understand letters or numbers. They understand bytes. UTF-8 is a rulebook that tells a computer how to represent characters like A, 7, or 3.14 as sequences of bytes.

UTF-8 is the encoding — it turns human-readable characters into machine-readable bytes.

Bytes are what actually travel over the network or get written to disk.

Any system that understands UTF-8 can reconstruct the same JSON text.

Comparing sizes: Java serialization vs JSON

Java serialized form (approximate)

Binary
Includes metadata
Not human-readable

~250 bytes (varies by JVM and class structure)

JSON text

The JSON string above is roughly:

{"id":1,"name":"Alice","department":"Engineering","salary":120000.0,"active":true}

That’s about 95–110 bytes in UTF-8.

JSON Limitations

Press enter or click to view image in full size

Json Limitations

JSON is amazing — human-readable, language-neutral, and ubiquitous.

However, Its with hidden costs:

Verbosity

Take our Employee example:

{  "id": 1,  "name": "Alice",  "dept": "Engineering",  "salary": 120000.0,  "active": true}

The field names (id, name, dept, salary, active) are repeated for every object.
Even though the actual data is tiny (numbers, booleans, strings), JSON still uses ~95–110 bytes just for a single object. (A list of 1M employees will be of size = ~95MB)
For large datasets or millions of messages in Kafka, this adds up quickly.

Text-based, not binary

JSON stores numbers as text: 120000.0 is 7 bytes, whereas a binary representation could take 4–8 bytes.
Booleans and small integers also consume more space than strictly necessary.
Parsing text is slower than parsing binary — especially for high-throughput systems.

No strict typing or schema

JSON doesn’t enforce types — "salary": "120000.0" could accidentally be a string instead of a number.
Consumers have to validate data at runtime, adding overhead.
Evolving JSON structures (adding/removing fields) can be tricky to manage consistently.

Human-readability vs efficiency

Human-readable text is nice for debugging, but it comes at a storage and network cost.
In high-performance systems (Kafka streams, microservices, or large databases), text-based payloads dominate bandwidth and storage.

And that’s exactly where binary serialization formats like Avro, Protobuf, or MessagePack enter the picture.

Enter Binary Formats

Press enter or click to view image in full size

Json Vs Binary Formats

A binary format representation of a Java object is a compact, minimized encoding of that object, where all unnecessary fluff is stripped away.

Instead of repeating field names and describing structure over and over again, binary formats focus on what actually matters:

field names are not repeated
Only values are stored
Numbers are written as raw numbers, not text
types are strictly enforced by a schema

What you end up with is a byte layout that is designed for machines, not humans.

This simple but powerful transformation turns large, verbose payloads into much smaller and faster-to-process binary data.

Most popular binary serialization formats — Avro, Thrift, and Protobuf — follow this exact principle: agree on the structure once, then transmit only the essential data.

Schemas

Press enter or click to view image in full size

Json Representation and Schema Representation

A schema is:

A language-neutral blueprint
Exists outside the code (or compiled into code for Protobuf/Thrift)
Defines structure and types only, no behavior
Focused on serialization and deserialization
Can evolve independently of application code

Schemas are the backbone of compactness in binary representations.

Avro represents same json as below

Press enter or click to view image in full size

Compact — Avro Record

This is 1/3rd smaller than the original JSON — 32 bytes, as it only has values in it

There are nuances in Protobuf vs. Avro, including how and where they store schemas.

Protobuf compiles schema to code, Avro stores schema to a registry which is accessible to both producer/consumer.

Protobuf’s are more centric to the RPC framework, Avro is more centric to BigData pipelines, Streaming systems — Kafka, Hadoop, etc

Network Impact

With binary serialization, message sizes reduce drastically. Instead of 95 MB for 1 million employee records in JSON, you might send only 32 MB with Avro.

This makes the network leaner. It reduces network traffic. It reduces the size of stored data.

The difference becomes even more pronounced at scale. If you’re processing billions of events per day through Kafka, the reduction from 100 bytes per message to 30 bytes per message means:

70% less network traffic
70% less disk I/O
70% less storage costs

That’s not optimization. That’s transformation.

The Bottom Line

JSON is amazing for APIs, debugging, and human-readable configuration. But once you cross into high-throughput systems, large-scale data pipelines, or storage-heavy applications, the hidden costs of text-based serialization become the bottleneck you can no longer ignore.

Binary formats like Avro and Protobuf aren’t just optimizations — they’re fundamental architecture decisions that change how your system scales, how your network performs, and how much your infrastructure costs.

The question isn’t whether to use binary serialization. It’s time to make the switch before JSON’s convenience becomes your most expensive technical debt.

*This article is inspired by and references concepts from Martin Kleppmann’s work on schema evolution at *https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Next, we’ll dive into Avro hands-on: writing schemas, serializing data, and seeing how your network payloads shrink dramatically.

Until Then, Cheers 🥂

Authored by

Venkatesh Wagh

Technology Tinkerer and enthusiast

Link to Bio

Enter serialization

Enter JSON

Comparing sizes: Java serialization vs JSON

Java serialized form (approximate)

JSON Limitations

Verbosity

Text-based, not binary

No strict typing or schema

Human-readability vs efficiency

Enter Binary Formats

Schemas

Network Impact

The Bottom Line

Similar Posts