7 min read20 hours ago
–
***Authored by: ***Venkatesh Wagh
Press enter or click to view image in full size
You’ve optimized everything.
Indexes are in place. Queries are tuned. Caches are warm.
And yet… something still feels off.
Network graphs keep climbing. Stored data keeps growing. Payload sizes quietly become the new bottleneck.
Someone whispers, “Can we compress it?” Someone else asks: “Can we represent it more compactly?”
Press enter or click to view image in full size
Optimized Ecosystem with bloated Network
And that’s when it hits: we don’t really know what our data looks like once it leaves memory.
Have you…
7 min read20 hours ago
–
***Authored by: ***Venkatesh Wagh
Press enter or click to view image in full size
You’ve optimized everything.
Indexes are in place. Queries are tuned. Caches are warm.
And yet… something still feels off.
Network graphs keep climbing. Stored data keeps growing. Payload sizes quietly become the new bottleneck.
Someone whispers, “Can we compress it?” Someone else asks: “Can we represent it more compactly?”
Press enter or click to view image in full size
Optimized Ecosystem with bloated Network
And that’s when it hits: we don’t really know what our data looks like once it leaves memory.
Have you ever stopped to really think about what happens when a Java application sends data to Kafka?
Or how a Java REST service hands a request List to a Python Flask app—and it somehow becomes a Python object on the other side?
What exactly travels over the wire? And how does the receiving system magically rebuild meaning from it?
Enter serialization
Press enter or click to view image in full size
Serialization Basics
Serialization is the invisible boundary most systems quietly cross.
The moment data leaves memory, it stops being an object.
It stops being aList, a Map, or a class instance tied to your language runtime.
To travel across networks, between services, or into storage, data has to become something every system can understand: bytes.
That’s where serialization comes in.
Serialization takes an in-memory representation — a Java object, a Python structure, a map, a list — and converts it into a sequence of bytes that can be:
- stored
- transmitted
- reconstructed elsewhere
On the receiving side, those bytes are interpreted and transformed back into an in-memory structure. This reverse step is called deserialization.
Textbook definition:
In simple terms, serialization is the process of converting data into a format that can be stored or transmitted.
Serialization sits at one of the most critical boundaries in any system:
- Memory ↔ Disk
- Service ↔ Service
- Different programming languages
- Today’s code ↔ Tomorrow’s changes
Most frameworks hide this boundary. But as systems scale, or as data outlives the code that produced it, this boundary becomes visible — and the choices you make in serialization start to matter.
serialization = language-neutral representation
Re-iterating: An object can be serialized into many formats: text-based formats like JSON or XML, or more compact binary formats.
Enter JSON
Press enter or click to view image in full size
As soon as data needs to leave your application and be understood by someone — or something — else, engineers reach for JSON.
Why? Because it just works.
- Language neutral — Python, Java, JavaScript, Ruby: it doesn’t care.
- Human-readable — you can open a file or a network packet and actually make sense of it.
- Easy to debug — no compiler required to see what went wrong.
- Everywhere — from web apps to Kafka, from MongoDB to Oracle, JSON is already there.
It doesn’t matter if your system is:
- a browser rendering a web page
- a Java backend service crunching numbers
- a Python microservice processing events
- a Kafka consumer handling millions of messages
- a database storing documents
- or a REST API exchanging payloads with the world
JSON fits. Always.
Over time, engineers stopped asking questions and started assuming it: JSON became the universal language of data.
- Web request/response flows? JSON.
- Kafka messages? Often JSON.
- Configuration files, logs, documents? JSON again.
It’s verbose. It’s readable. It is everywhere, and that ubiquity makes it hard to ignore.
Have you wondered how Json is sent over wire? (UTF-8 Encoding)
UTF-8 converts each character in the JSON text into a sequence of bytes. For simple ASCII characters (letters, digits, punctuation), 1 character = 1 byte.
Example byte mapping for part of the JSON:
Press enter or click to view image in full size
UTF-8 Conversation for each character (7B 22 69 64 22 3A 31 2C 22 6E 61…)
JSON is text, not bytes.
What is UTF-8?
UTF-8 is a character encoding.
In simple terms, it defines how text characters are converted into bytes.
Computers don’t understand letters or numbers. They understand bytes. UTF-8 is a rulebook that tells a computer how to represent characters like
A,7, or3.14as sequences of bytes.UTF-8 is the encoding — it turns human-readable characters into machine-readable bytes.
Bytes are what actually travel over the network or get written to disk.
Any system that understands UTF-8 can reconstruct the same JSON text.
Comparing sizes: Java serialization vs JSON
Java serialized form (approximate)
- Binary
- Includes metadata
- Not human-readable
~250 bytes (varies by JVM and class structure)
JSON text
The JSON string above is roughly:
{"id":1,"name":"Alice","department":"Engineering","salary":120000.0,"active":true}
That’s about 95–110 bytes in UTF-8.
JSON Limitations
Press enter or click to view image in full size
Json Limitations
JSON is amazing — human-readable, language-neutral, and ubiquitous.
However, Its with hidden costs:
Verbosity
Take our Employee example:
{ "id": 1, "name": "Alice", "dept": "Engineering", "salary": 120000.0, "active": true}
- The field names (
id,name,dept,salary,active) are repeated for every object. - Even though the actual data is tiny (numbers, booleans, strings), JSON still uses ~95–110 bytes just for a single object. (A list of 1M employees will be of size = ~95MB)
- For large datasets or millions of messages in Kafka, this adds up quickly.
Text-based, not binary
- JSON stores numbers as text:
120000.0is 7 bytes, whereas a binary representation could take 4–8 bytes. - Booleans and small integers also consume more space than strictly necessary.
- Parsing text is slower than parsing binary — especially for high-throughput systems.
No strict typing or schema
- JSON doesn’t enforce types —
"salary": "120000.0"could accidentally be a string instead of a number. - Consumers have to validate data at runtime, adding overhead.
- Evolving JSON structures (adding/removing fields) can be tricky to manage consistently.
Human-readability vs efficiency
- Human-readable text is nice for debugging, but it comes at a storage and network cost.
- In high-performance systems (Kafka streams, microservices, or large databases), text-based payloads dominate bandwidth and storage.
And that’s exactly where binary serialization formats like Avro, Protobuf, or MessagePack enter the picture.
Enter Binary Formats
Press enter or click to view image in full size
Json Vs Binary Formats
A binary format representation of a Java object is a compact, minimized encoding of that object, where all unnecessary fluff is stripped away.
Instead of repeating field names and describing structure over and over again, binary formats focus on what actually matters:
- field names are not repeated
- Only values are stored
- Numbers are written as raw numbers, not text
- types are strictly enforced by a schema
What you end up with is a byte layout that is designed for machines, not humans.
This simple but powerful transformation turns large, verbose payloads into much smaller and faster-to-process binary data.
Most popular binary serialization formats — Avro, Thrift, and Protobuf — follow this exact principle: agree on the structure once, then transmit only the essential data.
Schemas
Press enter or click to view image in full size
Json Representation and Schema Representation
A schema is:
- A language-neutral blueprint
- Exists outside the code (or compiled into code for Protobuf/Thrift)
- Defines structure and types only, no behavior
- Focused on serialization and deserialization
- Can evolve independently of application code
Schemas are the backbone of compactness in binary representations.
Avro represents same json as below
Press enter or click to view image in full size
Compact — Avro Record
This is 1/3rd smaller than the original JSON — 32 bytes, as it only has values in it
There are nuances in Protobuf vs. Avro, including how and where they store schemas.
Protobuf compiles schema to code, Avro stores schema to a registry which is accessible to both producer/consumer.
Protobuf’s are more centric to the RPC framework, Avro is more centric to BigData pipelines, Streaming systems — Kafka, Hadoop, etc
Network Impact
With binary serialization, message sizes reduce drastically. Instead of 95 MB for 1 million employee records in JSON, you might send only 32 MB with Avro.
This makes the network leaner. It reduces network traffic. It reduces the size of stored data.
The difference becomes even more pronounced at scale. If you’re processing billions of events per day through Kafka, the reduction from 100 bytes per message to 30 bytes per message means:
-
70% less network traffic
-
70% less disk I/O
-
70% less storage costs
That’s not optimization. That’s transformation.
The Bottom Line
JSON is amazing for APIs, debugging, and human-readable configuration. But once you cross into high-throughput systems, large-scale data pipelines, or storage-heavy applications, the hidden costs of text-based serialization become the bottleneck you can no longer ignore.
Binary formats like Avro and Protobuf aren’t just optimizations — they’re fundamental architecture decisions that change how your system scales, how your network performs, and how much your infrastructure costs.
The question isn’t whether to use binary serialization. It’s time to make the switch before JSON’s convenience becomes your most expensive technical debt.
*This article is inspired by and references concepts from Martin Kleppmann’s work on schema evolution at *https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Next, we’ll dive into Avro hands-on: writing schemas, serializing data, and seeing how your network payloads shrink dramatically.
Until Then, Cheers 🥂
Authored by
Technology Tinkerer and enthusiast
Link to Bio