What I Learned Today About Apache Spark Architecture

6 min read18 hours ago

–

Apache Spark often feels magical when we first start using it. We write a few lines of PySpark code, hit run, and suddenly terabytes of data are being processed in seconds. But behind this simplicity lies a powerful and beautifully engineered distributed system. Understanding Spark’s architecture is the key to writing efficient code, optimizing queries, and making the most of Databricks.

Press enter or click to view image in full size

Image by Author

We will explore what Spark is actually doing behind the scenes, how it runs our code, how clusters are organized, why lazy evaluation matters, and what makes Spark so fast. By the end, Spark will feel less like a black box and more like a system we fully understand.

Understanding the Spark Execution Arc…

6 min read18 hours ago

–

Press enter or click to view image in full size

Image by Author

Understanding the Spark Execution Architecture

A Spark application does not run on a single machine. Instead, it runs on a cluster which is a group of machines working together in parallel. To make this distributed execution possible, Spark follows a well structured architecture centered around three key elements the driver, the executors, and the cluster manager. Each plays a distinct role and together they form the backbone of Spark’s distributed computation model.

The Spark Driver

Every Spark program begins with the driver. We can think of the driver as the brain of our entire application. It runs our main program, creates the SparkSession, analyzes the tasks your code needs to perform, and constructs a plan for how those tasks should be executed on the cluster. The driver keeps track of metadata, manages the overall workflow, and collects results once the work is complete.

When writing PySpark code inside Databricks, our notebook is continuously communicating with the driver. The notebook sends instructions, the driver interprets them, converts them into execution plans, and decides how to distribute the work across the cluster.

Executors

While the driver is the brain, the executors are the workforce. Each executor runs on a separate machine in the cluster. Executors are responsible for executing the tasks assigned by the driver, storing intermediate data in memory for caching, and returning results back to the driver.

The more executors your cluster has, the more tasks can be executed in parallel leading to faster performance and the ability to handle larger datasets.

Each executor has its own slice of memory and CPU cores. This isolation is important because it means that one executor crashing does not necessarily bring down the entire application. Spark automatically handles such failures.

Cluster Managers

SSpark does not manage machines by itself. It relies on an external system called a cluster manager to allocate resources such as CPUs, RAM, and machines. In traditional Spark environments, this cluster manager might be Standalone, YARN, or Mesos.

Standalone mode is Spark’s built-in option, often used for testing or small deployments. YARN, commonly found in Hadoop based enterprises, supports massive clusters and is used widely in production. Mesos is more flexible but is less common today.

Databricks simplifies this entire layer. You do not choose YARN, Standalone, or anything else. Databricks automatically provisions, tunes, scales, and manages the clusters behind the scenes letting you focus solely on the code.

Press enter or click to view image in full size

Image by Author

RDD vs DataFrame vs Dataset. What’s the Difference?

To truly understand Spark, it helps to understand how it represents data internally and how that representation has evolved. Spark began with RDDs (Resilient Distributed Dataset), low level distributed collections of objects. RDDs offered immense control but required manual optimization. they were powerful but verbose and inefficient for complex analytical tasks.

Spark then introduced DataFrames, which provide a higher level, table like interface with columns and data types. DataFrames enable Spark to analyze our transformations and automatically optimize them through its Catalyst optimizer. This makes DataFrames far faster and significantly easier to work with than raw RDDs.

A third abstraction, Datasets, combines the type safety of RDDs with the optimizations of DataFrames, but only exists in Scala and Java. In Python, the modern Spark workflow is almost entirely centered around DataFrames and SQL.

Lazy Evaluation: Spark’s Secret Weapon

One of the reasons Apache Spark feels so fast and efficient is because it doesn’t rush to execute every line of code. Instead, Spark waits intentionally. This idea is called lazy evaluation, and it completely changes how our code runs.

When you write transformations like

df = df.filter(df.age > 18).groupBy("city").count()

Spark does not immediately filter, group, or count anything. Instead, Spark quietly builds a **logical plan, **a blueprint of what needs to be done.

Let’s say we run this,

df = spark.read.csv("sales.csv").select("city", "age", "amount")df2 = df.filter(df.age > 18)df3 = df2.groupBy("city").sum("amount")df3.show()

We just wrote four lines, but Spark actually executes only when .show() is called.

1. Spark Builds a Logical Plan (Unoptimized)

Think of this as Spark’s first draft.

Read CSV → Select columns city, age, amount → Filter rows where age > 18 → Group by city → Sum(amount) → Show

This is a raw plan. No optimization yet.

2. Spark Uses Catalyst to Create an Optimized Logical Plan

Catalyst looks for improvements

push filters down closer to the data source
remove unnecessary steps
prune unused columns
rearrange operations for efficiency

Optimized plan,

Read CSV (only city, age, amount — column pruning) → Filter age > 18 (pushed down) → Group by city → Sum(amount) → Show

3. Spark Builds a Physical Plan

Now Spark decides how to execute this across the cluster.

Stage 1:  - Read CSV in parallel  - Apply filter on executors  - Map city/amount pairsStage 2 (Shuffle):  - Move data so all rows of the same city end up together  - Perform aggregationStage 3:  - Display final result to driver (show)

This is the real plan that runs on executors.

Spark only performs computation when you call an action such as,

show()
collect()
count()
write()

This delayed execution allows Spark to,

Optimize the entire workflow
Remove unnecessary steps
Combine operations
Reduce shuffles and I/O

Lazy evaluation allows Spark to act like a smart planner,

“Before I do any work, let me figure out the quickest, cheapest, and most efficient way to get this done.”

This planning befor executing approach is a major reason Spark can process terabytes of data so quickly.

DAG: How Spark Organizes Your Computation

When an action is triggered, Spark creates a **DAG, **a Directed Acyclic Graph. The DAG represents the flow of your transformations, step by step.

Here’s what happens:

Our DataFrame code is parsed into a logical plan
Spark optimizes it using Catalyst
The optimized plan becomes a DAG of stages
Each stage contains tasks
Each task runs on an executor

If you think of your job as a recipe, The DAG is the list of step by step instructions, with ingredients, flow, and dependencies.

Understanding DAGs helps you debug, optimize performance, and understand why Spark behaves a certain way.

Transformations vs Actions

Every Spark operation falls into one of two categories. Transformations such as filter, select, join, or groupBy do not execute immediately. They simply add steps to the execution plan. Actions, on the other hand, trigger actual execution. When an action is called Spark constructs the DAG, schedules the tasks, sends work to executors, and performs the computation.

This **separation between planning and execution **is what makes Spark both flexible and fast allowing it to optimize work right before execution.

Transformations

These return a new DataFrame and are lazy. Examples include,

filter
select
withColumn
join
groupBy

They simply build the plan.

Actions

These trigger execution. Examples include

show
write
count
collect

Understanding Shuffle. The Costliest Spark Operation

One of the another most important concepts in Spark performance tuning is the shuffle. A shuffle happens when Spark needs to redistribute data across machines. For example during a join, a groupBy, a distinct, an orderBy, or a repartition. Imagine each executor is holding different chunks of data. A shuffle forces data to be moved around the cluster so that all the data belonging to the same key lands on the same executor.

Shuffles are expensive because they involve network transfer, disk I/O, data sorting, and high memory usage. Inefficient code often triggers multiple unnecessary shuffles, slowing everything down.

This usually occurs during

groupBy
join
distinct
orderBy
repartition

Wrapping Up

Apache Spark’s architecture is a blend of thoughtful design and powerful engineering. Once we understand how drivers, executors, cluster managers, DAGs, and shuffles work together, we can write PySpark code that is not only correct but optimized and production ready.

Understanding the Spark Execution Arc…

Understanding the Spark Execution Architecture

The Spark Driver

Executors

Cluster Managers

RDD vs DataFrame vs Dataset. What’s the Difference?

Lazy Evaluation: Spark’s Secret Weapon

1. Spark Builds a Logical Plan (Unoptimized)

2. Spark Uses Catalyst to Create an Optimized Logical Plan

3. Spark Builds a Physical Plan

DAG: How Spark Organizes Your Computation

Transformations vs Actions

Actions

Understanding Shuffle. The Costliest Spark Operation

Wrapping Up

Similar Posts