When working with data in cloud systems or analytics projects, the format you store your data in can make a huge difference in performance, scalability, and compatibility.
Different data formats are designed for different purposes — some are easy to read, while others are optimized for large-scale analytics.
In this article, we’ll explore six widely used data formats in analytics: CSV, SQL, JSON, Parquet, XML, and Avro, using a simple student dataset as an example.
🎯 Sample Dataset
1️⃣ CSV (Comma-Separated Values)
CSV is the simplest and most familiar data format. Each record is stored as a row, with commas separating the values. 📄 Example:
Name,Register_No,Subject,Marks
Abi,201,Statistics,100
Mano,250,Computer Science,99
Priya,260,English,95
Riya,265,Maths,100
✅ Pr…
When working with data in cloud systems or analytics projects, the format you store your data in can make a huge difference in performance, scalability, and compatibility.
Different data formats are designed for different purposes — some are easy to read, while others are optimized for large-scale analytics.
In this article, we’ll explore six widely used data formats in analytics: CSV, SQL, JSON, Parquet, XML, and Avro, using a simple student dataset as an example.
🎯 Sample Dataset
1️⃣ CSV (Comma-Separated Values)
CSV is the simplest and most familiar data format. Each record is stored as a row, with commas separating the values. 📄 Example:
Name,Register_No,Subject,Marks
Abi,201,Statistics,100
Mano,250,Computer Science,99
Priya,260,English,95
Riya,265,Maths,100
✅ Pros:Easy to create, read, and share.
⚠️ Cons:No data types or schema,Struggles with nested or complex data.
2️⃣ SQL (Structured Query Language)
SQL represents data in a relational table format, where data is organized into rows and columns.
📄 Example:
CREATE TABLE Students (
Name VARCHAR(50),
Register_No INT,
Subject VARCHAR(50),
Marks INT
);
INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Abi', 201, 'Statistics', 100),
('Mano', 250, 'Computer Science', 99),
('Priya', 260, 'English', 95),
('Riya', 265, 'Maths', 100);
✅ Pros:Enforces schema and data types,easy to query and manage structured data.
⚠️ Cons:Not suitable for unstructured or hierarchical data.
3️⃣ JSON (JavaScript Object Notation)
JSON stores data in key-value pairs. It’s lightweight, flexible, and widely used for APIs and NoSQL databases.
📄 Example:
[
{
"Name": "Abi",
"Register_No": 201,
"Subject": "Statistics",
"Marks": 100
},
{
"Name": "Mano",
"Register_No": 250,
"Subject": "Computer Science",
"Marks": 99
},
{
"Name": "Priya",
"Register_No": 260,
"Subject": "English",
"Marks": 95
},
{
"Name": "Riya",
"Register_No": 265,
"Subject": "Maths",
"Marks": 100
}
]
✅ Pros:Human-readable and easy to share.
⚠️ Cons:Takes more space compared to binary formats,Slow.
4️⃣ Parquet (Columnar Storage Format)
Parquet is a binary, columnar format created for efficient data analytics. It’s highly optimized for tools like Apache Spark, Hadoop, AWS Athena, and BigQuery.
📄 Example:
import pandas as pd
data = {
"Name": ["Abi", "Mano", "Priya", "Riya"],
"Register_No": [201, 250, 260, 265],
"Subject": ["Statistics", "Computer Science", "English", "Maths"],
"Marks": [100, 99, 95, 100]
}
df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)
print("✅ Parquet file created successfully!")
⚡ Parquet files are not human-readable — they store compressed binary data for faster processing.
✅ Pros:Great compression and query performance.
⚠️ Cons:Needs special libraries to read/write,Not ideal for simple text sharing.
5️⃣ XML (Extensible Markup Language)
XML represents data using a tag-based structure, making it hierarchical and self-descriptive.
📄 Example:
<Students>
<Student>
<Name>Abi</Name>
<Register_No>201</Register_No>
<Subject>Statistics</Subject>
<Marks>100</Marks>
</Student>
<Student>
<Name>Mano</Name>
<Register_No>250</Register_No>
<Subject>Computer Science</Subject>
<Marks>99</Marks>
</Student>
<Student>
<Name>Priya</Name>
<Register_No>260</Register_No>
<Subject>English</Subject>
<Marks>95</Marks>
</Student>
<Student>
<Name>Riya</Name>
<Register_No>265</Register_No>
<Subject>Maths</Subject>
<Marks>100</Marks>
</Student>
</Students>
✅ Pros:Self-descriptive and structured,Ideal for hierarchical data.
⚠️ Cons:Verbose and storage-heavy,Slower to parse compared to JSON.
6️⃣ Avro (Row-based Storage Format)
Avro is a row-based binary format designed for fast data serialization. It’s schema-based and often used in Apache Kafka and Hadoop ecosystems.
📄 Schema (students.avsc):
{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
📄 Example:
from fastavro import writer
schema = {
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
records = [
{"Name": "Abi", "Register_No": 201, "Subject": "Statistics", "Marks": 100},
{"Name": "Mano", "Register_No": 250, "Subject": "Computer Science", "Marks": 99},
{"Name": "Priya", "Register_No": 260, "Subject": "English", "Marks": 95},
{"Name": "Riya", "Register_No": 265, "Subject": "Maths", "Marks": 100}
]
with open("students.avro", "wb") as out:
writer(out, schema, records)
print("✅ Avro file created successfully!")
✅ Pros:Schema-based and consistent.
⚠️ Cons:Binary format (not readable),Requires Avro libraries to parse.
Knowing when to use each helps you build efficient, scalable, and cloud-ready data pipelines. 🌥️