Inside Apache Spark: Architecture, Optimization, and Execution

Understanding Apache Spark Architecture from the Ground Up

15 min read18 hours ago

–

Press enter or click to view image in full size

Apache Spark

I’ve previously written two blog posts about Spark — one as an introduction to my Scala series, and the other on a practical implementation using PySpark.

In this post, I’ll cover Spark’s architecture and the key concepts related to it.

Definition

Let’s begin with the definition.

Press enter or click to view image in full size

Generated with ChatGPT.

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.

Unified computing engine

Spark provides a single engine that can handle many types of big data workloads.

Press enter or click to view image in full si…

Understanding Apache Spark Architecture from the Ground Up

15 min read18 hours ago

–

Press enter or click to view image in full size

Apache Spark

I’ve previously written two blog posts about Spark — one as an introduction to my Scala series, and the other on a practical implementation using PySpark.

In this post, I’ll cover Spark’s architecture and the key concepts related to it.

Definition

Let’s begin with the definition.

Press enter or click to view image in full size

Generated with ChatGPT.

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.

Unified computing engine

Spark provides a single engine that can handle many types of big data workloads.

Press enter or click to view image in full size

One platform.

Instead of using separate systems for each workload (like Hadoop MapReduce for batch, Storm for streaming, Hive for SQL), Spark offers one platform that can do all of them with consistent APIs and performance characteristics.

You do not have to stitch multiple systems together. This is why we call it unified.

Computing engine

A “computing engine” refers to the component that actually executes computations across a cluster:

Press enter or click to view image in full size

Computing Engine. Illustrated with ChatGPT.

Set of libraries

Spark includes high-level libraries built on top of the core engine.

Press enter or click to view image in full size

Spark libraries. Source

All these libraries share the same execution engine and work together seamlessly.

Parallel data processing

Spark processes data in parallel using the resources of a distributed cluster.

Parallelization. Source

It breaks large datasets into partitions and executes tasks on them concurrently across multiple machines.

On computer clusters

Spark is designed for environments where you have multiple computers working together:

Each machine = a worker node
One machine = driver node (controls the job)
A cluster manager allocates resources (YARN, Kubernetes, Mesos, standalone)

Spark can also run on a single machine for development, but its real power emerges on a cluster.

Spark gives you one powerful engine that can run many types of data workloads, using parallel processing across many machines, supported by rich libraries for SQL, streaming, ML, and graph analytics.

Where We Can Use Apache Spark

We can use Spark on all of the major platforms:

Databricks
Amazon EMR
Google Cloud Dataproc
Microsoft Azure (Databricks or Azure Synapse Analytics)
IBM Cloud (Analytics Engine or Watson)

Spark can run on container orchestration platforms:

Kubernetes (via Spark-on-K8s integration)
OpenShift

Spark can be deployed in your own data center:

Standalone Spark clusters
Hadoop YARN clusters
Apache Mesos clusters
Bare-metal servers
VM-based clusters

Spark can also run locally for development and testing:

On laptop (local mode)
Docker containers
local PySpark or Spark-shell setup

Core Execution Architecture

We have written an application in Python, Scala, SQL, or Java.

When we run that code, Spark creates a SparkSession, which acts as the entry point to the Spark engine.

Press enter or click to view image in full size

High level architecture.

Driver / Master Node

The driver is the brain of the Spark application.

It receives the application and determines what we want to do with the data. Once it identifies the most efficient way to execute the application, it splits the work into tasks that are distributed to the executors.

The driver plans the jobs and ensures they are executed. If an executor fails, the driver assigns the task to another executor. The driver almost never interacts with the data directly.

It is a JVM process responsible for:

Running Spark application
Holding the SparkSession
Holding the SparkContext
Creating the query plans
Sending tasks to executors
Collecting results

SparkSession → User-facing entry point

SparkContext → Low-level engine interface

Spark Driver → The process that runs SparkSession + SparkContext and controls your job

They are not the same, but they live inside each other.

The driver hosts the SparkSession
The SparkSession contains the SparkContext
The SparkContext talks to the cluster manager

SparkContext

SparkContext is the core engine interface in Spark.

It is the component that connects your application (the driver) to the cluster manager and executors.

Press enter or click to view image in full size

Functions of SparkContext in Spark. Source

Get the Current Status of Application — Allows you to query the real-time running state, active jobs, and progress of your Spark application.
Set the Configuration — Applies Spark settings (memory, executors, serialization, etc.) that control how the application runs on the cluster.
Cancel a Job — Stops a running job immediately to free resources or interrupt long computations.
Access Various Services — Provides access to internal Spark services such as the block manager, environment info, and cluster manager.
Closure Cleaning in Spark — Automatically removes unnecessary variables from closures before sending them to executors to optimize serialization.
Programmable Dynamic Allocation — Adjusts the number of executors up or down based on workload demand.
Register Spark Listener — Lets you attach custom listeners to track events like job start, job end, task failures, and executor changes.
Unpersist RDDs — Removes cached RDDs from memory or disk to free up storage.
Cancel a Stage — Aborts only a specific stage within a job without stopping the entire job.
Access Persistent RDD — Provides a list of all currently cached RDDs along with their storage details.

Cluster Manager

This is the component that Spark uses to request resources (CPU, RAM, machines).

Depending on where you run:

YARN (Hadoop)
Kubernetes
Standalone Spark
Mesos

The cluster manager decides:

How many executors to allocate
How much memory and CPU per executor
Which machines will run them

It is not running tasks — it simply orchestrates resources.

Executors / Workers Nodes

Executors are JVM processes that run on worker nodes in the cluster (or locally in local mode).

Press enter or click to view image in full size

Source

Each executor has a configured number of cores (spark.executor.cores) that determines how many tasks it can run concurrently.

Executors have memory allocated (spark.executor.memory) for:

Storage memory (cached RDDs/DataFrames)
Execution memory (shuffles, joins, sorts)
User memory (user data structures)

The driver’s scheduler assigns tasks to executors, which then execute them and report results back.

Executors are launched when the application starts and remain alive until the application completes (unless they fail or are dynamically removed). This is in contrast to MapReduce, where containers are created per job.

The driver and each executor run as separate JVM processes, usually on different worker nodes (which may be bare metal, VMs, or containers). In local mode, they can all run on the same machine.

There can be more than 1 executor on a worker node.

There are as many slots on the executor as CPU cores.

There can be as many tasks executed by the Executor in parallel as many slots. A task is the smallest unit of work.

Press enter or click to view image in full size

High level architecture.

Task Execution Flow

Driver sends tasks → Executors

The driver divides your job into stages, then into tasks:

Each task handles one data partition
Tasks are sent to executors for actual computation

Executors run tasks

Inside each executor:

every slot runs one task at a time
tasks load the relevant data partition
tasks perform the transformation (map, reduce, join, etc.)

Executors work in parallel across the cluster.

Executors return results → Driver

After finishing their computations, executors send:

task results
metrics
logs
shuffle outputs

back to the driver, or written to the disc.

Data Shuffling

Executors sometimes need to exchange data with each other — especially for:

join operations
groupBy
reduceByKey
sorting

This is called shuffle. It is often the costliest operation in Spark because it:

moves data across the network
writes intermediate files
reroutes partitions
It requires saving data to the disk, sending it over the network and reading data from the disk

Deployment Modes

Local Mode
Client Mode
Cluster Mode Deployment modes.

Local Mode

Driver and executors both run on your computer.

Everything happens on your machine: driver + execution threads.
No actual cluster is used.
Used for development, testing, and learning Spark.

Client Mode (Cluster-Client Mode)

Driver runs on your machine, executors run in the cluster.

You submit the job from your laptop or edge node.
The driver process stays on your machine, so logs and UI are local.
Executors run remotely on the cluster nodes.
If the submitter machine dies → the job dies, because the driver is local.
Used interactively in notebooks, development environments, Databricks notebooks, etc.

Cluster Mode

Both driver and executors run in the cluster.

The driver is launched inside the cluster.
Your laptop only submits the job and disconnects.
The job continues even if your machine shuts down.
Best for production, scheduled jobs, pipelines.
Usually used with YARN, Kubernetes, Dataproc, EMR, Spark Standalone, etc.

Abstractions on Data

Apache Spark provides multiple abstraction layers for working with distributed data. Each layer gives you a different balance of control, performance, and developer productivity.

👉 RDD = the old way

👉 DataFrames / Datasets = the modern, optimized way

Press enter or click to view image in full size

History of Spark APIs. Source

RDD (Resilient Distributed Dataset) — Low-Level API

RDDs are the foundational, low-level data structure in Spark. They are the original and low-level API Spark introduced in version 1.x.

They represent an immutable, distributed collection of objects.

They are considered the old approach:

RDDs do not use the Catalyst optimizer
Developers must control everything
Most modern Spark workloads should use the higher-level Structured APIs (DataFrames / Datasets). The RDD API still exists for low-level and special-use cases, but is no longer the recommended default.

But RDD is NOT deprecated — just lower-level.

Structured API / Spark SQL (High-Level API)

This layer includes:

DataFrames
Datasets
SQL queries

These APIs are declarative: You describe what you want, and Spark decides how to execute it.

Press enter or click to view image in full size

Spark Structured API. Source

Catalyst Optimizer automatically optimizes your query plan
Produces faster, more efficient execution
Safer, easier, more stable APIs
Best for almost all practical ETL, analytics, and ML workflows

DataFrames are built on top of RDDs. They organize data into named columns, like a relational table, and support SQL queries, DataFrame transformations, and automatically optimized execution plans.

Datasets work best in Scala, not available in PySpark.

Operations

Spark operations come in two types:** Transformations** and Actions.

Press enter or click to view image in full size

Spark operations. Source

Transformations

Transformations describe what you want to do to the data.

They are lazy and do not run until Spark sees an action. Spark only builds a logical plan during transformations.

Narrow transformations

Data stays within the same partition → no shuffle

Wide transformations

Require data to move across partitions → shuffle

Actions

Actions are the operations that make Spark actually run the computation and return results.

Press enter or click to view image in full size

Transformation vs Actions. Source

Jobs, Stages, and Tasks

When the driver receives an action (like count(), show(), save()), it builds a physical execution plan.

Execution plan. Source

Job

A job corresponds to one action you call. It is the entire computation Spark needs to run to deliver the result of that action.

Examples that create jobs:

count()
collect()
show()
write.format(...).save()

If your script contains 3 actions, Spark will create 3 different jobs.

Stage

Spark splits a job into stages, based on where shuffles are required.

Narrow transformations stay in the same stage.
Wide transformations require a shuffle, so Spark splits into another stage.

Task

Each stage splits into many tasks: one task per partition.

Executors run tasks in parallel.

If your DataFrame has 8 partitions → you get 8 tasks in that stage.

DAG

A DAG (Directed Acyclic Graph) is the blueprint Spark creates to understand how your job should be executed, step-by-step, across the cluster.

Press enter or click to view image in full size

Example DAG. Source

Stage 0

Inside Stage 0, Spark reads data using:

newAPIHadoopFile
then applies map

This is a narrow transformation chain, meaning Spark can run these transformations without shuffling data.

The vertical arrows show operation dependencies
Stage 0 produces intermediate data used in Stage 2

Stage 1

Stage 1 also starts with a different input:

newLogsRDD
followed by filter, filter, map

It operates on a different input source and contains only narrow transformations.

Output of Stage 1 feeds into Stage 2’s cogroup.

Stage 2

This is the most complex part of the DAG.

cogroup
mapValues
flatMap
map
union

And Stage 2 also reads a third input via newAPIHadoopFile, then applies:

filter
map
then union

Stage 2 contains wide transformations, which require shuffling data across partitions:

cogroup always creates a shuffle
joins, cogroup, groupBy always break stages

Stage 2 cannot start until Stage 0 and Stage 1 are finished.

Stage 3

Stage 3 contains:

reduceByKey
coalesce

This is the final step → the output of reduceByKey becomes the result of the job.

Optimization

Optimizing Spark is a two-sided responsibility:

**Spark **automatically performs deep internal optimizations using Catalyst, Tungsten, and AQE.
**The user **still has important decisions to make to avoid bottlenecks and ensure efficiency.

What the User Can Do to Optimize Spark

These are data engineering best practices — choices you control that dramatically influence performance.

Use efficient file formats
Avoid expensive operations like sorts
Minimize data volume
Cache/persist only when beneficial
Repartition or coalesce appropriately
Avoid UDFs
Partition / index / bucket the data → Improves skipping of irrelevant data and speeds up joins.
Optimize the cluster

User optimizations help Spark avoid unnecessary work.

Catalyst Optimizer

Catalyst is Spark’s brain that takes a SQL/DataFrame query and turns it into the most efficient execution plan.

Press enter or click to view image in full size

Catalyst Optimizer. Source

1️⃣ You start with: SQL Query / DataFrame / Dataset

2️⃣ Unresolved Logical Plan

Spark parses your query and creates an initial logical plan, but at this point:

table names may not be verified
columns may not be resolved yet
no optimizations have happened

This is why it is called unresolved.

3️⃣ Analysis Phase — Using the Catalog

Catalyst now uses Spark’s Catalog (metadata store) to:

check if tables exist
check column names and data types
resolve references

After resolving all attributes, Spark produces a Logical Plan.

This plan represents what needs to be done, not how.

4️⃣ Logical Optimization Phase

Catalyst applies a series of rule-based optimizations, such as:

constant folding → calculate constant expressions upfront
predicate pushdown → push WHERE filters as close to data source as possible
projection pruning → remove unused columns
null propagation
simplifying expressions

The result is the Optimized Logical Plan.

Still logical — no execution strategy yet.

5️⃣ Physical Planning Phase

Given the optimized logical plan, Spark generates multiple possible physical plans.

Examples of physical strategies:

Hash Join
Sort-Merge Join
Broadcast Join
ParquetScan vs FileScan
Hash Aggregate

6️⃣ Cost Model — Choosing the Best Plan

Spark evaluates each physical plan using its cost model, which estimates:

shuffle cost
scan cost
memory/CPU usage
broadcast feasibility
data size

Based on these estimations, Spark selects the lowest cost plan.

This becomes the Selected Physical Plan.

7️⃣ Code Generation (Whole Stage Codegen)

Now the physical plan is translated into efficient JVM bytecode:

loops are fused
operators run as compiled code
unnecessary memory allocation is avoided

The output of this phase is execution-ready RDD code.

This is what Tungsten runs on the executors.

8️⃣ Execution on RDDs

Finally, Spark sends the RDD execution graph to executors, partition by partition.

Tungsten Engine

Tungsten is Spark’s low-level execution engine whose primary goal is to make Spark run as efficiently as possible — almost as fast as hand-written native code.

It focuses on memory and CPU efficiency.

Tungsten rewrites Spark’s internal engine to use off-heap memory and CPU-efficient processing.

Catalyst decides WHAT operations Spark should perform. But Catalyst does not execute anything.

Tungsten takes the plan from Catalyst and decides HOW to actually execute it efficiently.

Spark takes entire sections of the execution plan and compiles them into one tight Java function.

Instead of running operators one by one, Spark fuses them:

for (row in data) {    if (row meets filter) {        result = compute something...        write output    }}

No virtual function call
No intermediate objects
No per-row overhead

This is similar to how a C++ program might run. That’s why Tungsten is often described as “closer to bare metal.”

Everything is optimized for speed + efficiency.

Catalyst → High-level optimizer

Optimizes logical planning (SQL-level).

Tungsten → Low-level execution engine

Optimizes actual machine operations on memory and CPU.

Adaptive Query Execution (AQE)

AQE is Spark’s ability to re-optimize a query while it is running.

Traditional Catalyst optimization happens before execution, using estimated statistics.

Press enter or click to view image in full size

AQE in Spark. Source

AQE happens during execution, using actual runtime data.

Estimates Spark uses before execution (table sizes, partition sizes, join cardinalities…) are often inaccurate.

AQE fixes this by monitoring runtime statistics and dynamically adjusting the execution plan.

How Spark Solves MapReduce Problems

MapReduce was designed for large batch processing, but it has several fundamental limitations.

Press enter or click to view image in full size

MapReduce vs Spark. Source

After every map and reduce step, MapReduce writes intermediate results to disk.

A workflow like:

Step 1: Filter → write to disk  Step 2: Join → write to disk  Step 3: Aggregate → write to disk

Each step flushes to HDFS → High latency.

Map ----> Disk ----> Reduce ----> Disk ----> Next Job ----> Disk

MapReduce does not keep data in memory between stages.

This makes iterative algorithms extremely slow (ML, graph, ETL).

MapReduce forces everything into map() and reduce() functions.

Spark was built specifically to overcome the limitations above.

Spark keeps intermediate results in RAM using RDDs/DataFrames, avoiding disk writes.

Transform ----> Transform ----> Transform      (All in memory, lazy evaluation)

Spark can be 100x faster than MapReduce.

MapReduce’s Problems:

Writes to disk after every step (slow).
No memory caching → terrible for iterative jobs.
Limited programming model (map + reduce only).
High latency → not suitable for real-time analytics.

Spark’s Solutions:

In-memory computing → 100x faster.
DAG execution engine → optimized multi-step pipelines.
Caching → ideal for ML and iterative algorithms.
High-level APIs + Catalyst + Tungsten + AQE → automatic optimization.
Supports batch, streaming, SQL, ML — all unified.

PySpark

Spark is written in Scala, which runs on the JVM. But when you write PySpark code, you’re writing Python, not Scala.

Python code cannot talk directly to Spark’s Scala engine. PySpark solves this using wrappers and a communication bridge.

Source

This is your Python script:

from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()

This runs as a Python process. Python driver must communicate with the real Spark driver (written in Scala/Java).

So inside the same container:

Python main()  →  JVM main()(PySpark driver)   (Spark driver)

PySpark sends commands to the JVM driver using a protocol called Py4J.

Since Spark is written in Scala, Spark exposes Java-compatible APIs.

Scala ↔ Java are interoperable.

What really happens:

spark.read.csv() → Python wrapper
It calls the equivalent JVM method via Py4J
The JVM creates a DataFrame object in the Java world
Python stores only a reference pointing to the JVM DataFrame

For most DataFrame/SQL operations, the data is processed in the JVM on executors, and Python only holds references. However, data is actually brought into Python for actions like collect()/toPandas() and for Python/pandas UDFs, where batches are serialized between the JVM and Python worker processes.

Memory

An executor container = 🟩 On-heap memory (managed by Spark & JVM) 🟪 Off-heap memory (optional, managed outside JVM) 🟥 Overhead memory (extra safety buffer)

Inside on-heap memory, Spark further splits memory into:

Execution Memory
Storage Memory
User Memory
Reserved Memory

Press enter or click to view image in full size

Memory. Source

Spark Executor Container

This is the total memory allocated per executor.

On-Heap Memory

This is the main block where the JVM stores Spark objects.

Execution Memory is used for runtime operations.
Storage Memory is used to cache or persist.
User Memory is used for Python objects, UDF-created objects, etc.
Reserved Memory is a default 300MB region Spark keeps for itself.

Off-Heap Memory

Used for:

Tungsten binary format
UnsafeRow storage
Shuffle memory
External memory allocations outside the JVM
Reducing GC pressure

Off-heap memory is used to reduce GC overhead and can be faster for Spark workloads, but it’s not ‘safer’ than heap memory.

Misconfigured off-heap allocations can actually be more dangerous because they bypass the JVM’s GC protection.

Overhead Memory

Allocated automatically.

Default:

max(384 MB, 10% of executor memory)

It’s used for:

Python worker processes (PySpark)
Native libraries
JVM overhead
Off-heap allocations, if not explicitly set

Without overhead memory, executors would crash with Out of Memory errors.

On-Heap Memory (spark.executor.memory)    ├── Execution Memory      → joins, shuffles, sort    ├── Storage Memory        → caching RDD/DataFrame    ├── User Memory           → Python objects, UDFs    └── Reserved Memory       → 300MB safety zoneOff-Heap Memory               → Tungsten, off-heap cachingOverhead Memory               → JVM overhead, Python workers