Understanding Apache Spark Architecture from the Ground Up
15 min read18 hours ago
–
Press enter or click to view image in full size
Apache Spark
I’ve previously written two blog posts about Spark — one as an introduction to my Scala series, and the other on a practical implementation using PySpark.
In this post, I’ll cover Spark’s architecture and the key concepts related to it.
Definition
Let’s begin with the definition.
Press enter or click to view image in full size
Generated with ChatGPT.
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Unified computing engine
Spark provides a single engine that can handle many types of big data workloads.
Press enter or click to view image in full si…
Understanding Apache Spark Architecture from the Ground Up
15 min read18 hours ago
–
Press enter or click to view image in full size
Apache Spark
I’ve previously written two blog posts about Spark — one as an introduction to my Scala series, and the other on a practical implementation using PySpark.
In this post, I’ll cover Spark’s architecture and the key concepts related to it.
Definition
Let’s begin with the definition.
Press enter or click to view image in full size
Generated with ChatGPT.
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Unified computing engine
Spark provides a single engine that can handle many types of big data workloads.
Press enter or click to view image in full size
One platform.
Instead of using separate systems for each workload (like Hadoop MapReduce for batch, Storm for streaming, Hive for SQL), Spark offers one platform that can do all of them with consistent APIs and performance characteristics.
You do not have to stitch multiple systems together. This is why we call it unified.
Computing engine
A “computing engine” refers to the component that actually executes computations across a cluster:
Press enter or click to view image in full size
Computing Engine. Illustrated with ChatGPT.
Set of libraries
Spark includes high-level libraries built on top of the core engine.
Press enter or click to view image in full size
Spark libraries. Source
All these libraries share the same execution engine and work together seamlessly.
Parallel data processing
Spark processes data in parallel using the resources of a distributed cluster.
Parallelization. Source
It breaks large datasets into partitions and executes tasks on them concurrently across multiple machines.
On computer clusters
Spark is designed for environments where you have multiple computers working together:
- Each machine = a worker node
- One machine = driver node (controls the job)
- A cluster manager allocates resources (YARN, Kubernetes, Mesos, standalone)
Spark can also run on a single machine for development, but its real power emerges on a cluster.
Spark gives you one powerful engine that can run many types of data workloads, using parallel processing across many machines, supported by rich libraries for SQL, streaming, ML, and graph analytics.
Where We Can Use Apache Spark
We can use Spark on all of the major platforms:
- Databricks
- Amazon EMR
- Google Cloud Dataproc
- Microsoft Azure (Databricks or Azure Synapse Analytics)
- IBM Cloud (Analytics Engine or Watson)
Spark can run on container orchestration platforms:
- Kubernetes (via Spark-on-K8s integration)
- OpenShift
Spark can be deployed in your own data center:
- Standalone Spark clusters
- Hadoop YARN clusters
- Apache Mesos clusters
- Bare-metal servers
- VM-based clusters
Spark can also run locally for development and testing:
- On laptop (local mode)
- Docker containers
- local PySpark or Spark-shell setup
Core Execution Architecture
We have written an application in Python, Scala, SQL, or Java.
When we run that code, Spark creates a SparkSession, which acts as the entry point to the Spark engine.
Press enter or click to view image in full size
High level architecture.
Driver / Master Node
The driver is the brain of the Spark application.
It receives the application and determines what we want to do with the data. Once it identifies the most efficient way to execute the application, it splits the work into tasks that are distributed to the executors.
The driver plans the jobs and ensures they are executed. If an executor fails, the driver assigns the task to another executor. The driver almost never interacts with the data directly.
It is a JVM process responsible for:
- Running Spark application
- Holding the
SparkSession - Holding the
SparkContext - Creating the query plans
- Sending tasks to executors
- Collecting results
SparkSession → User-facing entry point
SparkContext → Low-level engine interface
Spark Driver → The process that runs SparkSession + SparkContext and controls your job
They are not the same, but they live inside each other.
- The driver hosts the SparkSession
- The
SparkSessioncontains theSparkContext - The
SparkContexttalks to the cluster manager
SparkContext
SparkContext is the core engine interface in Spark.
It is the component that connects your application (the driver) to the cluster manager and executors.
Press enter or click to view image in full size
Functions of SparkContext in Spark. Source
- Get the Current Status of Application — Allows you to query the real-time running state, active jobs, and progress of your Spark application.
- Set the Configuration — Applies Spark settings (memory, executors, serialization, etc.) that control how the application runs on the cluster.
- Cancel a Job — Stops a running job immediately to free resources or interrupt long computations.
- Access Various Services — Provides access to internal Spark services such as the block manager, environment info, and cluster manager.
- Closure Cleaning in Spark — Automatically removes unnecessary variables from closures before sending them to executors to optimize serialization.
- Programmable Dynamic Allocation — Adjusts the number of executors up or down based on workload demand.
- Register Spark Listener — Lets you attach custom listeners to track events like job start, job end, task failures, and executor changes.
- Unpersist RDDs — Removes cached RDDs from memory or disk to free up storage.
- Cancel a Stage — Aborts only a specific stage within a job without stopping the entire job.
- Access Persistent RDD — Provides a list of all currently cached RDDs along with their storage details.
Cluster Manager
This is the component that Spark uses to request resources (CPU, RAM, machines).
Depending on where you run:
- YARN (Hadoop)
- Kubernetes
- Standalone Spark
- Mesos
The cluster manager decides:
- How many executors to allocate
- How much memory and CPU per executor
- Which machines will run them
It is not running tasks — it simply orchestrates resources.
Executors / Workers Nodes
Executors are JVM processes that run on worker nodes in the cluster (or locally in local mode).
Press enter or click to view image in full size
Each executor has a configured number of cores (spark.executor.cores) that determines how many tasks it can run concurrently.
Executors have memory allocated (spark.executor.memory) for:
- Storage memory (cached RDDs/DataFrames)
- Execution memory (shuffles, joins, sorts)
- User memory (user data structures)
The driver’s scheduler assigns tasks to executors, which then execute them and report results back.
Executors are launched when the application starts and remain alive until the application completes (unless they fail or are dynamically removed). This is in contrast to MapReduce, where containers are created per job.
The driver and each executor run as separate JVM processes, usually on different worker nodes (which may be bare metal, VMs, or containers). In local mode, they can all run on the same machine.
There can be more than 1 executor on a worker node.
There are as many slots on the executor as CPU cores.
There can be as many tasks executed by the Executor in parallel as many slots. A task is the smallest unit of work.
Press enter or click to view image in full size
High level architecture.
Task Execution Flow
Driver sends tasks → Executors
The driver divides your job into stages, then into tasks:
- Each task handles one data partition
- Tasks are sent to executors for actual computation
Executors run tasks
Inside each executor:
- every slot runs one task at a time
- tasks load the relevant data partition
- tasks perform the transformation (map, reduce, join, etc.)
Executors work in parallel across the cluster.
Executors return results → Driver
After finishing their computations, executors send:
- task results
- metrics
- logs
- shuffle outputs
back to the driver, or written to the disc.
Data Shuffling
Executors sometimes need to exchange data with each other — especially for:
- join operations
- groupBy
- reduceByKey
- sorting
This is called shuffle. It is often the costliest operation in Spark because it:
- moves data across the network
- writes intermediate files
- reroutes partitions
- It requires saving data to the disk, sending it over the network and reading data from the disk
Deployment Modes
- Local Mode
- Client Mode
- Cluster Mode Deployment modes.
Local Mode
Driver and executors both run on your computer.
- Everything happens on your machine: driver + execution threads.
- No actual cluster is used.
- Used for development, testing, and learning Spark.
Client Mode (Cluster-Client Mode)
Driver runs on your machine, executors run in the cluster.
- You submit the job from your laptop or edge node.
- The driver process stays on your machine, so logs and UI are local.
- Executors run remotely on the cluster nodes.
- If the submitter machine dies → the job dies, because the driver is local.
- Used interactively in notebooks, development environments, Databricks notebooks, etc.
Cluster Mode
Both driver and executors run in the cluster.
- The driver is launched inside the cluster.
- Your laptop only submits the job and disconnects.
- The job continues even if your machine shuts down.
- Best for production, scheduled jobs, pipelines.
- Usually used with YARN, Kubernetes, Dataproc, EMR, Spark Standalone, etc.
Abstractions on Data
Apache Spark provides multiple abstraction layers for working with distributed data. Each layer gives you a different balance of control, performance, and developer productivity.
👉 RDD = the old way
👉 DataFrames / Datasets = the modern, optimized way
Press enter or click to view image in full size
History of Spark APIs. Source
RDD (Resilient Distributed Dataset) — Low-Level API
RDDs are the foundational, low-level data structure in Spark. They are the original and low-level API Spark introduced in version 1.x.
They represent an immutable, distributed collection of objects.
They are considered the old approach:
- RDDs do not use the Catalyst optimizer
- Developers must control everything
- Most modern Spark workloads should use the higher-level Structured APIs (DataFrames / Datasets). The RDD API still exists for low-level and special-use cases, but is no longer the recommended default.
But RDD is NOT deprecated — just lower-level.
Structured API / Spark SQL (High-Level API)
This layer includes:
- DataFrames
- Datasets
- SQL queries
These APIs are declarative: You describe what you want, and Spark decides how to execute it.
Press enter or click to view image in full size
Spark Structured API. Source
- Catalyst Optimizer automatically optimizes your query plan
- Produces faster, more efficient execution
- Safer, easier, more stable APIs
- Best for almost all practical ETL, analytics, and ML workflows
DataFrames are built on top of RDDs. They organize data into named columns, like a relational table, and support SQL queries, DataFrame transformations, and automatically optimized execution plans.
Datasets work best in Scala, not available in PySpark.
Operations
Spark operations come in two types:** Transformations** and Actions.
Press enter or click to view image in full size
Spark operations. Source
Transformations
Transformations describe what you want to do to the data.
They are lazy and do not run until Spark sees an action. Spark only builds a logical plan during transformations.
Narrow transformations
Data stays within the same partition → no shuffle
Wide transformations
Require data to move across partitions → shuffle
Actions
Actions are the operations that make Spark actually run the computation and return results.
Press enter or click to view image in full size
Transformation vs Actions. Source
Jobs, Stages, and Tasks
When the driver receives an action (like count(), show(), save()), it builds a physical execution plan.
Execution plan. Source
Job
A job corresponds to one action you call. It is the entire computation Spark needs to run to deliver the result of that action.
Examples that create jobs:
count()collect()show()write.format(...).save()
If your script contains 3 actions, Spark will create 3 different jobs.
Stage
Spark splits a job into stages, based on where shuffles are required.
- Narrow transformations stay in the same stage.
- Wide transformations require a shuffle, so Spark splits into another stage.
Task
Each stage splits into many tasks: one task per partition.
Executors run tasks in parallel.
If your DataFrame has 8 partitions → you get 8 tasks in that stage.
DAG
A DAG (Directed Acyclic Graph) is the blueprint Spark creates to understand how your job should be executed, step-by-step, across the cluster.
Press enter or click to view image in full size
Example DAG. Source
Stage 0
Inside Stage 0, Spark reads data using:
newAPIHadoopFile- then applies
map
This is a narrow transformation chain, meaning Spark can run these transformations without shuffling data.
- The vertical arrows show operation dependencies
- Stage 0 produces intermediate data used in Stage 2
Stage 1
Stage 1 also starts with a different input:
newLogsRDD- followed by
filter,filter,map
It operates on a different input source and contains only narrow transformations.
Output of Stage 1 feeds into Stage 2’s cogroup.
Stage 2
This is the most complex part of the DAG.
cogroupmapValuesflatMapmapunion
And Stage 2 also reads a third input via newAPIHadoopFile, then applies:
filtermap- then
union
Stage 2 contains wide transformations, which require shuffling data across partitions:
cogroupalways creates a shuffle- joins, cogroup, groupBy always break stages
Stage 2 cannot start until Stage 0 and Stage 1 are finished.
Stage 3
Stage 3 contains:
reduceByKeycoalesce
This is the final step → the output of reduceByKey becomes the result of the job.
Optimization
Optimizing Spark is a two-sided responsibility:
- **Spark **automatically performs deep internal optimizations using Catalyst, Tungsten, and AQE.
- **The user **still has important decisions to make to avoid bottlenecks and ensure efficiency.
What the User Can Do to Optimize Spark
These are data engineering best practices — choices you control that dramatically influence performance.
- Use efficient file formats
- Avoid expensive operations like sorts
- Minimize data volume
- Cache/persist only when beneficial
- Repartition or coalesce appropriately
- Avoid UDFs
- Partition / index / bucket the data → Improves skipping of irrelevant data and speeds up joins.
- Optimize the cluster
User optimizations help Spark avoid unnecessary work.
Catalyst Optimizer
Catalyst is Spark’s brain that takes a SQL/DataFrame query and turns it into the most efficient execution plan.
Press enter or click to view image in full size
Catalyst Optimizer. Source
1️⃣ You start with: SQL Query / DataFrame / Dataset
2️⃣ Unresolved Logical Plan
Spark parses your query and creates an initial logical plan, but at this point:
- table names may not be verified
- columns may not be resolved yet
- no optimizations have happened
This is why it is called unresolved.
3️⃣ Analysis Phase — Using the Catalog
Catalyst now uses Spark’s Catalog (metadata store) to:
- check if tables exist
- check column names and data types
- resolve references
After resolving all attributes, Spark produces a Logical Plan.
This plan represents what needs to be done, not how.
4️⃣ Logical Optimization Phase
Catalyst applies a series of rule-based optimizations, such as:
- constant folding → calculate constant expressions upfront
- predicate pushdown → push
WHEREfilters as close to data source as possible - projection pruning → remove unused columns
- null propagation
- simplifying expressions
The result is the Optimized Logical Plan.
Still logical — no execution strategy yet.
5️⃣ Physical Planning Phase
Given the optimized logical plan, Spark generates multiple possible physical plans.
Examples of physical strategies:
- Hash Join
- Sort-Merge Join
- Broadcast Join
- ParquetScan vs FileScan
- Hash Aggregate
6️⃣ Cost Model — Choosing the Best Plan
Spark evaluates each physical plan using its cost model, which estimates:
- shuffle cost
- scan cost
- memory/CPU usage
- broadcast feasibility
- data size
Based on these estimations, Spark selects the lowest cost plan.
This becomes the Selected Physical Plan.
7️⃣ Code Generation (Whole Stage Codegen)
Now the physical plan is translated into efficient JVM bytecode:
- loops are fused
- operators run as compiled code
- unnecessary memory allocation is avoided
The output of this phase is execution-ready RDD code.
This is what Tungsten runs on the executors.
8️⃣ Execution on RDDs
Finally, Spark sends the RDD execution graph to executors, partition by partition.
Tungsten Engine
Tungsten is Spark’s low-level execution engine whose primary goal is to make Spark run as efficiently as possible — almost as fast as hand-written native code.
It focuses on memory and CPU efficiency.
Tungsten rewrites Spark’s internal engine to use off-heap memory and CPU-efficient processing.
Catalyst decides WHAT operations Spark should perform. But Catalyst does not execute anything.
Tungsten takes the plan from Catalyst and decides HOW to actually execute it efficiently.
Spark takes entire sections of the execution plan and compiles them into one tight Java function.
Instead of running operators one by one, Spark fuses them:
for (row in data) { if (row meets filter) { result = compute something... write output }}
- No virtual function call
- No intermediate objects
- No per-row overhead
This is similar to how a C++ program might run. That’s why Tungsten is often described as “closer to bare metal.”
Everything is optimized for speed + efficiency.
Catalyst → High-level optimizer
- Optimizes logical planning (SQL-level).
Tungsten → Low-level execution engine
- Optimizes actual machine operations on memory and CPU.
Adaptive Query Execution (AQE)
AQE is Spark’s ability to re-optimize a query while it is running.
Traditional Catalyst optimization happens before execution, using estimated statistics.
Press enter or click to view image in full size
AQE in Spark. Source
AQE happens during execution, using actual runtime data.
Estimates Spark uses before execution (table sizes, partition sizes, join cardinalities…) are often inaccurate.
AQE fixes this by monitoring runtime statistics and dynamically adjusting the execution plan.
How Spark Solves MapReduce Problems
MapReduce was designed for large batch processing, but it has several fundamental limitations.
Press enter or click to view image in full size
MapReduce vs Spark. Source
After every map and reduce step, MapReduce writes intermediate results to disk.
A workflow like:
Step 1: Filter → write to disk Step 2: Join → write to disk Step 3: Aggregate → write to disk
Each step flushes to HDFS → High latency.
Map ----> Disk ----> Reduce ----> Disk ----> Next Job ----> Disk
MapReduce does not keep data in memory between stages.
This makes iterative algorithms extremely slow (ML, graph, ETL).
MapReduce forces everything into map() and reduce() functions.
Spark was built specifically to overcome the limitations above.
Spark keeps intermediate results in RAM using RDDs/DataFrames, avoiding disk writes.
Transform ----> Transform ----> Transform (All in memory, lazy evaluation)
Spark can be 100x faster than MapReduce.
MapReduce’s Problems:
- Writes to disk after every step (slow).
- No memory caching → terrible for iterative jobs.
- Limited programming model (map + reduce only).
- High latency → not suitable for real-time analytics.
Spark’s Solutions:
- In-memory computing → 100x faster.
- DAG execution engine → optimized multi-step pipelines.
- Caching → ideal for ML and iterative algorithms.
- High-level APIs + Catalyst + Tungsten + AQE → automatic optimization.
- Supports batch, streaming, SQL, ML — all unified.
PySpark
Spark is written in Scala, which runs on the JVM. But when you write PySpark code, you’re writing Python, not Scala.
Python code cannot talk directly to Spark’s Scala engine. PySpark solves this using wrappers and a communication bridge.
This is your Python script:
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()
This runs as a Python process. Python driver must communicate with the real Spark driver (written in Scala/Java).
So inside the same container:
Python main() → JVM main()(PySpark driver) (Spark driver)
PySpark sends commands to the JVM driver using a protocol called Py4J.
Since Spark is written in Scala, Spark exposes Java-compatible APIs.
Scala ↔ Java are interoperable.
What really happens:
spark.read.csv()→ Python wrapper- It calls the equivalent JVM method via Py4J
- The JVM creates a DataFrame object in the Java world
- Python stores only a reference pointing to the JVM DataFrame
For most DataFrame/SQL operations, the data is processed in the JVM on executors, and Python only holds references. However, data is actually brought into Python for actions like collect()/toPandas() and for Python/pandas UDFs, where batches are serialized between the JVM and Python worker processes.
Memory
An executor container = 🟩 On-heap memory (managed by Spark & JVM) 🟪 Off-heap memory (optional, managed outside JVM) 🟥 Overhead memory (extra safety buffer)
Inside on-heap memory, Spark further splits memory into:
- Execution Memory
- Storage Memory
- User Memory
- Reserved Memory
Press enter or click to view image in full size
Memory. Source
Spark Executor Container
This is the total memory allocated per executor.
On-Heap Memory
This is the main block where the JVM stores Spark objects.
- Execution Memory is used for runtime operations.
- Storage Memory is used to cache or persist.
- User Memory is used for Python objects, UDF-created objects, etc.
- Reserved Memory is a default 300MB region Spark keeps for itself.
Off-Heap Memory
Used for:
- Tungsten binary format
- UnsafeRow storage
- Shuffle memory
- External memory allocations outside the JVM
- Reducing GC pressure
Off-heap memory is used to reduce GC overhead and can be faster for Spark workloads, but it’s not ‘safer’ than heap memory.
Misconfigured off-heap allocations can actually be more dangerous because they bypass the JVM’s GC protection.
Overhead Memory
Allocated automatically.
Default:
- max(384 MB, 10% of executor memory)
It’s used for:
- Python worker processes (PySpark)
- Native libraries
- JVM overhead
- Off-heap allocations, if not explicitly set
Without overhead memory, executors would crash with Out of Memory errors.
On-Heap Memory (spark.executor.memory) ├── Execution Memory → joins, shuffles, sort ├── Storage Memory → caching RDD/DataFrame ├── User Memory → Python objects, UDFs └── Reserved Memory → 300MB safety zoneOff-Heap Memory → Tungsten, off-heap cachingOverhead Memory → JVM overhead, Python workers
Read More
Sources
https://www.youtube.com/watch?v=iXVIPQEGZ9Y
https://www.youtube.com/watch?v=z4Owc8RRApg
https://www2.stat.duke.edu/courses/Fall20/sta523/slides/lecture/lec_23.html#1
https://www.simplilearn.com/tutorials/big-data-tutorial/spark-parallelize
https://data-flair.training/blogs/learn-apache-spark-sparkcontext/
https://techgurudehradun.in/blog/?p=77
https://www.databricks.com/glossary/catalyst-optimizer
https://k21academy.com/oracle/spark-vs-mapreduce/
https://proedu.co/spark/understanding-apache-spark-architecture/