A Practical Tour of Spark SQL (PySpark): DataFrames, Columns, Windows, and More

Spark SQL for Practitioners: The APIs You Use, The Engine You Don’t SeeI’ve been writing about Apache Spark recently, and my latest piece covers Spark’s architecture. You can check it out at the link below.That chapter was mostly an abstract, technical overview of Spark’s internal mechanisms. In this chapter, we’ll take a more practical approach by exploring Spark SQL.Throughout the article, we will consistently work with the same dataset (credit_risk_dataset.csv from Kaggle).Spark SQL is Spark’s structured data processing engine.It consists of multiple APIs (SQL strings, DataFrame API, programmatic expressions).A common misconception is to think of Spark SQL as a syntax, but it is rather an engine.A SQL query written as a string and a PySpark DataFrame transformation chain are different expressions of the same underlying computation.Spark SQL exposes core abstractions. These classes form the foundation of you will do in PySpark SQL.If you browse the official Spark documentation, you’ll see dozens of entries covering execution, planning, expressions, encoders, catalogs, stateful processing, and more.Define with Spark SQLRepresent how queries are describedAct as the bridge between user code and Spark’s internal execution engineUnder the hood, Spark SQL involves:But as a PySpark user, you rarely touch those directly. You with them indirectly through a small, stable set of APIs.Among the many classes in Spark SQL, the following form the conceptual backbone of everyday PySpark usage:SparkSession: Entry point and coordinatorDataFrame: Distributed, structured datasetColumn: Declarative expressionFunctions API: Expression constructionSchema / DataTypes: Type systemIf Spark SQL had a control center, it would be SparkSession. No Spark SQL code runs outside of it.In PySpark, a SparkSession is typically created once per application.from pyspark.sql import SparkSession SparkSession.builder .appName(“spark-sql-example”) # If a session already exists, .getOrCreate() # Spark returns it instead of creating ) # a new one.Conceptually, SparkSession acts as:Coordinator for Spark SQLGateway to metadata and configurationExecution trigger for SQL queriesThe most obvious Spark SQL interface is the ability to run SQL queries directly.spark.sql(“”“ SELECT AVG(loan_amnt) AS avg_loan_amount GROUP BY loan_intentSparkSession is also the entry point for DataFrame creation:df = ( spark.read .option(“header”, “true”) .option(“inferSchema”, “true”) .csv(“credit_risk_dataset.csv”))The catalog is Spark SQL’s .Through the SparkSession, you can:List databases and tablesInspect schemas programmaticallyWhen we register a DataFrame as a view:df.createOrReplaceTempView(“credit_risk”)spark.catalog.listTables()[Table(name=‘credit_risk’, catalog=None, namespace=[], description=None, tableType=‘TEMPORARY’, isTemporary=True)]Lives only in this SparkSessionIs accessible via both SQL and DataFrame APIsIs not persisted unless explicitly savedSpark SQL behavior is heavily influenced by configuration.You can inspect or modify configs via:spark.conf.get(“spark.sql.shuffle.partitions”)And set them (if mutable at runtime):spark.conf.set(“spark.sql.shuffle.partitions”, 200)Not all configurations are dynamicSome must be set before session creationChanging a config does not retroactively affect already-built plansBest Practices for SparkSession LifecycleCreate One Session Per ApplicationStop the Session Explicitly spark.stop()Spark SQL is not a fixed engine with fixed behavior. Its execution model, performance characteristics, and even query plans are heavily influenced by configuration.It can be set at multiple levels:SparkSession builder (startup time)Runtime configuration via spark.confCluster / deployment defaultsFor Spark SQL, the most relevant configs usually start with:Setting Configuration at Session CreationSome Spark SQL configurations must be set the SparkSession is created.spark = ( SparkSession.builder .appName(“spark-sql-example”) .config(“spark.sql.shuffle.partitions”, “200”) .config(“spark.sql.adaptive.enabled”, “true”) .getOrCreate()If a config affects , set it before creating the session.Some configurations can be modified during runtime:spark.conf.set(“spark.sql.shuffle.partitions”, 100)However, runtime mutability does not guarantee impact:Existing DataFrames may already have logical plansCached plans or DataFrames may ignore new valuesThis often surprises users.Best Practices for Spark SQL ConfigurationSet Planning-Sensitive Configs EarlyPrefer Fewer, Intentional ChangesTreat Defaults as Assumptions, Not TruthDocument Configuration ChoicesMinimal Spark SQL Configuration ChecklistIt’s a safe baseline for most Spark SQL jobs.Must-Review (Almost Always)spark.sql.shuffle.partitions: Controls how many partitions Spark creates after shuffle operations such as groupBy, join, distinct, and window functions.spark.sql.adaptive.enabled: Enables Adaptive Query Execution (AQE), allowing Spark to optimize queries at runtime instead of relying solely on static plans.spark.sql.autoBroadcastJoinThreshold: Defines the maximum size (in bytes) of a table that Spark will automatically broadcast during joins.Nice-to-Have (Case Dependent)spark.sql.caseSensitive: Determines whether column names are treated as case-sensitive.spark.sql.execution.arrow.pyspark.enabled: Enables Apache Arrow for optimized data transfer between Spark and Python.spark.sql.ansi.enabled: Enforces ANSI SQL compliance for behavior like division by zero, invalid casts, and overflow handling.Spark SQL lives and dies by how data enters and leaves the system.In Spark SQL, reading data always starts from the SparkSession.df = ( spark.read .option(“header”, “true”) .option(“inferSchema”, “true”) .csv(“credit_risk_dataset.csv”))CSV files do not carry schema information. Spark must either:For exploratory analysis, inference is acceptable. For production pipelines, it is often not.Defining a schema makes behavior predictable and faster.from pyspark.sql.types import * StructField(“person_age”, IntegerType(), True), StructField(“person_income”, IntegerType(), True), StructField(“person_home_ownership”, StringType(), True), StructField(“person_emp_length”, DoubleType(), True), StructField(“loan_intent”, StringType(), True), StructField(“loan_grade”, StringType(), True), StructField(“loan_amnt”, IntegerType(), True), StructField(“loan_int_rate”, DoubleType(), True), StructField(“loan_status”, IntegerType(), True), StructField(“loan_percent_income”, DoubleType(), True), StructField(“cb_person_default_on_file”, StringType(), True), StructField(“cb_person_cred_hist_length”, IntegerType(), True),]) “credit_risk_dataset.csv”, schema=schemaWriting data is also declarative.df.write.mode(“overwrite”).parquet(“credit_risk_parquet”)Partitioning is one of the most powerful I/O optimizations.df.write \ .partitionBy(“loan_intent”) \ .parquet(“credit_risk_partitioned”)Common CSV Pitfalls in Spark SQLStrings interpreted as numbers incorrectlyEmpty values silently becoming nullHeader mismatches across filesThese issues are why serious Spark SQL pipelines rarely keep CSV as the final storage format.A common real-world pattern:Perform all analytics on ParquetThe DataFrame is the in Spark SQL. Almost everything you do revolves around DataFrames.Yet the most common mistake is thinking a DataFrame is “data”.A Spark DataFrame is a , not a collection of rows in memory.df = spark.read.csv(“credit_risk_dataset.csv”, header=True, inferSchema=True)Spark builds a that describes:This plan can later be optimized, rewritten, and executed in parallel.This is why Spark SQL can scale: it works with , not immediate results.Transformations vs ActionsBuild or modify the logical plandf.select(“person_age”, “person_income”)df.filter(df.person_income > 50000)df.groupBy(“loan_intent”).avg(“loan_amnt”)You can chain transformations indefinitely at almost no cost.Move data out of Spark’s lazy worlddf.count()df.show()df.write.parquet(“output”)If the DataFrame is the table, the Column is the you use to describe computations inside that table.Almost every Spark SQL transformation is ultimately built from Column expressions.A Spark Column is .You are not accessing data. You are referencing a column expression that says:“When this plan runs, read the person_income field.”Columns are immutable. Every column operation returns a new Column.income_k = df.person_income / 1000This does not modify person_income. It creates a that Spark can later optimize.Column expressions can be freely composed.from pyspark.sql.functions import col (col(“loan_int_rate”) > 15) & (col(“loan_percent_income”) > 0.3))This builds a single expression tree, not multiple computations.# Correct(df.loan_status == 1) & (df.person_income > 50000)(df.loan_status == 1) and (df.person_income > 50000)Derived columns should always be named explicitly.from pyspark.sql.functions import avgdf.groupBy(“loan_grade”) \ .agg(avg(“loan_int_rate”).alias(“avg_interest_rate”))Spark SQL is a strongly typed engine, even though PySpark is a dynamically typed API.It has its own logical type system, independent of Python’s types.When data is loaded into a DataFrame:Each column is assigned a Spark SQL data typeAll expressions are validated against those typesInvalid operations fail at analysis or execution timedf.printSchema()root |– person_age: integer (nullable = true) |– person_income: integer (nullable = true) |– person_home_ownership: string (nullable = true) |– person_emp_length: double (nullable = true) |– loan_intent: string (nullable = true) |– loan_grade: string (nullable = true) |– loan_amnt: integer (nullable = true) |– loan_int_rate: double (nullable = true) |– loan_status: integer (nullable = true) |– loan_percent_income: double (nullable = true) |– cb_person_default_on_file: string (nullable = true) |– cb_person_cred_hist_length: integer (nullable = true)Spark SQL types fall into two broad categories. types represent single, atomic values. types represent nested or multi-valued structures.StructType – nested recordsArrayType – ordered collectionsMapType – key-value pairsA StructType represents a nested object.from pyspark.sql.functions import structdf_with_struct = df.withColumn( “person_profile”, struct(“person_age”, “person_income”, “person_home_ownership”))Arrays represent ordered collections of values.from pyspark.sql.functions import arraydf_with_array = df.withColumn( “risk_factors”, array(“loan_intent”, “loan_grade”, “cb_person_default_on_file”))Maps store key-value pairs.from pyspark.sql.functions import create_map, litdf_with_map = df.withColumn( “loan_summary”, lit(“amount”), df.loan_amnt, lit(“rate”), df.loan_int_rate )Spark SQL does not automatically “fix” type mismatches. You must cast explicitly when needed.from pyspark.sql.functions import col “loan_percent_income”, col(“loan_percent_income”).cast(“double”))You can also cast using SQL syntax:spark.sql(“”“ SELECT CAST(loan_int_rate AS DOUBLE) AS loan_int_rate FROM credit_riskThe Row object is the least important abstraction you will interact with regularly in Spark SQL.A single, materialized recordLiving in Python memory, not Spark’s execution engineThey are not part of the logical plan. They are the of a plan.rows = df.take(5)Rows become a problem when:You apply Python logic row by rowYou collect large datasetsExample of a common anti-pattern:for row in df.collect(): if row.person_income > 50000:Forces full materializationMoves computation to the driverThis defeats the purpose of Spark SQL.Spark SQL functions are the of Spark SQL expressions.If Column objects describe to compute, to compute it — in a way Spark can understand, optimize, and execute efficiently across a cluster.One of the most important rules in Spark SQL is:If a built-in function exists, use it instead of Python logic.Executed inside the Spark engineThey operate on Column objects and return new Column expressions.from pyspark.sql.functions import col, avgavg(col(“loan_int_rate”))Used for arithmetic and numeric transformations.from pyspark.sql.functions import abs, round round(col(“loan_int_rate”), 2).alias(“rate_rounded”), abs(col(“loan_percent_income”)).alias(“abs_ratio”))# DataFrame[rate_rounded: double, abs_ratio: double]Used for text manipulation.from pyspark.sql.functions import lower, trim lower(col(“loan_intent”)).alias(“loan_intent_norm”), trim(col(“person_home_ownership”)))# DataFrame[loan_intent_norm: string, trim(person_home_ownership): string]Spark SQL provides a rich date-time API.from pyspark.sql.functions import year, month# If dates existed in the datasetyear(col(“application_date”))# Column<‘year(application_date)’>Conditional logic belongs in expressions, not Python if statements.from pyspark.sql.functions import when “risk_flag”, when(col(“loan_int_rate”) > 15, “high”) .otherwise(“normal”)from pyspark.sql.functions import avg, countdf.groupBy(“loan_grade”) \ .agg( avg(“loan_int_rate”).alias(“avg_rate”), count(“”).alias(“count”)Spark SQL functions explicitly handle nulls.from pyspark.sql.functions import coalesce “emp_length_filled”, coalesce(col(“person_emp_length”), lit(0)))Why Functions Are Central to Spark SQLFunctions are the bridge between:If your Spark SQL code feels slow or fragile, look at:How many Python operations you’re usingWhether equivalent built-in functions existMost performance gains come from rewriting logic using functions, not tuning clusters.Top 10 Spark SQL Functions You Actually NeedCreates a column reference explicitly.Implements conditional logic, replaces Python if logic.when(col(“loan_int_rate”) > 15, “high”).otherwise(“normal”)Returns the first non-null value.coalesce(col(“person_emp_length”), lit(0))Computes averages in aggregations.Counts rows or non-null values.Aggregates numeric values.Controls numeric precision.round(col(“loan_int_rate”), 2)Normalizes string values.lower(col(“loan_intent”))Performs regex-based string cleanup.regexp_replace(col(“loan_intent”), “_”, “ “)Creates a literal value as a Column.If groupBy answers questions like:“What is the average loan amount per loan grade?”Window functions answer a different class of questions:“How does this row compare to other rows in the same group?”from pyspark.sql.window import Windowwindow_by_grade = Window.partitionBy(“loan_grade”)“For each row, consider all rows with the same loan grade.”Window functions are used with .over(window).Example: average interest rate per loan grade, attached to every row.from pyspark.sql.functions import avg “avg_rate_by_grade”, avg(“loan_int_rate”).over(window_by_grade))By default, Spark uses a frame equivalent to:“All rows in the partition”You can restrict it explicitly:window_frame = ( Window .partitionBy(“loan_grade”) .orderBy(“loan_int_rate”) .rowsBetween(Window.unboundedPreceding, Window.currentRow))It defines a window frame that includes all rows from the start of the partition up to the current row, enabling cumulative (running) calculations within each partition.Window functions are expensive because they:from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, avg, countfrom pyspark.sql.window import Window# — 1) SparkSession —spark = SparkSession.builder.appName(“groupby-vs-window”).getOrCreate()spark.sparkContext.setLogLevel(“ERROR”) # cleaner output# — 2) Load dataset —df = ( .option(“header”, “true”) .option(“inferSchema”, “true”) .csv(“credit_risk_dataset.csv”)# Keep only columns we need for the demobase = df.select(“loan_grade”, “loan_int_rate”, “loan_amnt”, “loan_status”)print(“\n— Original (sample rows) —”)base.show(10, truncate=False)# — 3) GROUP BY (reduces rows: one row per grade) —gb = ( base.groupBy(“loan_grade”) .agg( count(“”).alias(“n_rows_gb”), avg(“loan_int_rate”).alias(“avg_int_rate_gb”), avg(“loan_amnt”).alias(“avg_loan_amnt_gb”), ))print(“\n— GROUP BY result (one row per grade) —”)gb.show(100, truncate=False)# — 4) WINDOW (preserves rows: adds per-grade stats to each row) —w = Window.partitionBy(“loan_grade”) base.withColumn(“n_rows_win”, count(“”).over(w)) .withColumn(“avg_int_rate_win”, avg(“loan_int_rate”).over(w)) .withColumn(“avg_loan_amnt_win”, avg(“loan_amnt”).over(w)) # keep null rates from dominating the sample output .orderBy(col(“loan_grade”), col(“loan_int_rate”).asc_nulls_last())print(“\n— WINDOW result (same rows, enriched; NULL rates sorted last) —”)win.show(20, truncate=False)# — 5) Side-by-side comparison (join groupBy back to base) —compare = ( base.join(gb.select(“loan_grade”, “avg_int_rate_gb”), on=“loan_grade”, how=“left”) .withColumn(“avg_int_rate_win”, avg(“loan_int_rate”).over(w)) “loan_grade”, “avg_int_rate_gb”, “loan_amnt”, ) .orderBy(col(“loan_grade”), col(“loan_int_rate”).asc_nulls_last()))print(“\n— Compare: groupBy metric joined back vs window metric (should match per grade) —”)compare.show(30, truncate=False)— Original (sample rows) —+–––––+———––+———+———–+|loan_grade|loan_int_rate|loan_amnt|loan_status|+–––––+———––+———+———–+|D |16.02 |35000 |1 ||B |11.14 |1000 |0 ||C |12.87 |5500 |1 ||C |15.23 |35000 |1 ||C |14.27 |35000 |1 ||A |7.14 |2500 |1 ||B |12.42 |35000 |1 ||B |11.11 |35000 |1 ||A |8.9 |35000 |1 ||D |14.74 |1600 |1 |+–––––+———––+———+———–+— GROUP BY result (one row per grade) —+–––––+———+——————+——————+|loan_grade|n_rows_gb|avg_int_rate_gb |avg_loan_amnt_gb |+–––––+———+——————+——————+|A |10777 |7.327650910578976 |8539.273452723392 ||B |10451 |10.995555082491098|9995.483685771696 ||C |6458 |13.463541523678883|9213.862650975534 ||D |3626 |15.361448400724077|10849.241588527302||E |964 |17.009455164585738|12915.845435684647||F |241 |18.609158878504676|14717.323651452281||G |64 |20.251525423728815|17195.703125 |+–––––+———+——————+——————+— WINDOW result (same rows, enriched; NULL rates sorted last) —+–––––+———––+———+———–+–––––+—————–+—————–+|loan_grade|loan_int_rate|loan_amnt|loan_status|n_rows_win|avg_int_rate_win |avg_loan_amnt_win|+–––––+———––+———+———–+–––––+—————–+—————–+|A |5.42 |8000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1350 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |5400 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |2000 |1 |10777 |7.327650910578976|8539.273452723392||A |5.42 |4025 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |8000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1300 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |5000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |2500 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1500 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |2000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |1800 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |2000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |2400 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |7500 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |5000 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |5200 |0 |10777 |7.327650910578976|8539.273452723392||A |5.42 |8000 |0 |10777 |7.327650910578976|8539.273452723392|+–––––+———––+———+———–+–––––+—————–+—————–+only showing top 20 rows— Compare: groupBy metric joined back vs window metric (should match per grade) —+–––––+———––+—————–+—————–+———+———–+|loan_grade|loan_int_rate|avg_int_rate_gb |avg_int_rate_win |loan_amnt|loan_status|+–––––+———––+—————–+—————–+———+———–+|A |5.42 |7.327650910578976|7.327650910578976|2400 |0 ||A |5.42 |7.327650910578976|7.327650910578976|4000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|7500 |0 ||A |5.42 |7.327650910578976|7.327650910578976|2000 |1 ||A |5.42 |7.327650910578976|7.327650910578976|5000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|8000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|5200 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1300 |0 ||A |5.42 |7.327650910578976|7.327650910578976|8000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|11200 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1500 |0 ||A |5.42 |7.327650910578976|7.327650910578976|2500 |0 ||A |5.42 |7.327650910578976|7.327650910578976|2000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|6000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|2000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|7200 |0 ||A |5.42 |7.327650910578976|7.327650910578976|4025 |0 ||A |5.42 |7.327650910578976|7.327650910578976|8000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|5400 |0 ||A |5.42 |7.327650910578976|7.327650910578976|5000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1350 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|1800 |0 ||A |5.42 |7.327650910578976|7.327650910578976|10000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|3000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|12000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|4000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|3000 |0 ||A |5.42 |7.327650910578976|7.327650910578976|2100 |0 |+–––––+———––+—————–+—————–+———+———–+only showing top 30 rowsGrouping is the aggregation backbone of Spark SQL.Partitions data by key(s)Aggregates rows within each partitionReduces the number of rows in the resultdf.groupBy(“loan_grade”).count().show()+–––––+—–++–––––+—–+| E| 964|| D| 3626|| A|10777|+–––––+—–+Grouping in Spark SQL has two conceptual phases:Rows are redistributed by group keysAggregation functions are applied per groupfrom pyspark.sql.functions import avg, count df.groupBy(“loan_grade”) count(“”).alias(“num_loans”), avg(“loan_int_rate”).alias(“avg_interest_rate”), )+–––––+———+——————+|loan_grade|num_loans| avg_interest_rate|+–––––+———+——————+| F| 241|18.609158878504676|| E| 964|17.009455164585738|| B| 10451|10.995555082491098|| D| 3626|15.361448400724077|| C| 6458|13.463541523678883|| A| 10777| 7.327650910578976|| G| 64|20.251525423728815|+–––––+———+——————+You can group by more than one key:intent_grade_stats = ( df.groupBy(“loan_intent”, “loan_grade”))intent_grade_stats.show()+—————–+–––––+—–+| loan_intent|loan_grade|count|+—————–+–––––+—–+| VENTURE| G| 14|| EDUCATION| B| 2026|| HOMEIMPROVEMENT| B| 1115||DEBTCONSOLIDATION| G| 10|| MEDICAL| C| 1217|| MEDICAL| G| 12|| VENTURE| A| 1933|| MEDICAL| E| 167|| HOMEIMPROVEMENT| E| 143|| VENTURE| B| 1871|| PERSONAL| A| 1832|| VENTURE| C| 1104|| PERSONAL| D| 643||DEBTCONSOLIDATION| B| 1620|| HOMEIMPROVEMENT| C| 693||DEBTCONSOLIDATION| F| 43|| VENTURE| E| 178|| EDUCATION| C| 1356|| EDUCATION| A| 2174||DEBTCONSOLIDATION| E| 144|+—————–+–––––+—–+only showing top 20 rowsGrouping is almost always paired with aggregations:The Spark SQL Catalog is the metadata layer that allows Spark to reason about tables, views, schemas, and databases.Tracks tables, views, functions, and databasesActs as the bridge between SQL queries and DataFramesThe most common interaction with the catalog is registering a DataFrame as a temporary view.df.createOrReplaceTempView(“credit_risk”)The DataFrame is addressable via SQLOnly metadata is registeredspark.sql(“SELECT COUNT() FROM credit_risk”).show()This query and a DataFrame-based df.count() share the same execution engine.You can inspect catalog contents programmatically.# List Tablesspark.catalog.listTables()# Check if a Table Existsspark.catalog.tableExists(“credit_risk”)Avro is a schema-aware binary data format designed for data exchange, not ad-hoc analytics.Avro was designed to solve a specific problem:How do we safely exchange structured data between systems without losing schema information?Spark SQL supports Avro as a native data source.# readingdf_avro = spark.read.format(“avro”).load(“credit_risk.avro”)df.write.mode(“overwrite”).format(“avro”).save(“credit_risk_avro”)A common real-world pattern:Ingest raw data (CSV / JSON)Validate and normalize schemaWrite to Avro for safe interchangeConvert to Parquet for analyticsAvro acts as the , Parquet as the .Observation is one of Spark SQL’s most underused but powerful features.It allows you to collect metrics during query execution — without triggering extra actions, duplicating work, or breaking lazy evaluation.Observation is a mechanism that lets Spark SQL:Attach metrics to a queryCompute them as part of the same executionExpose the results after the action completesIn PySpark, you create an Observation object and attach it to a DataFrame.from pyspark.sql import Observationfrom pyspark.sql.functions import count, sum, colobs = Observation(“credit_risk_metrics”)observed_df = df.observe( obs, count(“”).alias(“row_count”), sum(col(“loan_int_rate”).isNull().cast(“int”)).alias(“null_interest_rate_count”)Metrics are defined, not computedObservation metrics are computed .observed_df.write.mode(“overwrite”).parquet(“output”)After the action completes, you can access the results:metrics = obs.getprint(metrics)# {‘row_count’: 32581, ‘null_interest_rate_count’: 3116}Observation is about , not transformation.Sanity metrics (row counts, null counts)If you need to know something about your data but don’t need to , Observation is usually the right tool.User-Defined Functions (UDFs) allow you to execute custom logic inside Spark SQL when built-in functions are not sufficient.They are also one of the easiest ways to accidentally destroy Spark SQL performance.Write custom logic (usually in Python)Apply it to DataFrame columnsUse it in expressions like built-in functionsfrom pyspark.sql.functions import udffrom pyspark.sql.types import StringType if rate is None: return “high” if rate > 15 else “normal“risk_udf = udf(risk_label, StringType())df.withColumn(“risk_label”, risk_udf(“loan_int_rate”))DataFrame[person_age: int, person_income: int, person_home_ownership: string, person_emp_length: double, loan_intent: string, loan_grade: string, loan_amnt: int, loan_int_rate: double, loan_status: int, loan_percent_income: double, cb_person_default_on_file: string, cb_person_cred_hist_length: int, risk_label: string]Why UDFs Are Different from Built-in FunctionsBuilt-in Spark SQL functions:Participate in query planningSupport predicate pushdown and code generationRequire serialization/deserializationAre opaque to the optimizerBreak many Spark SQL optimizationsSpark treats UDFs as .What the Optimizer Can’t SeeSpark cannot reason about its logicFilters can’t be pushed downExpressions can’t be simplifiedExecution plans become more conservativeNo built-in function can express the logicPerformance impact is acceptableProprietary scoring logicSpecialized text processingEven then, you should pause and ask:Can this be expressed using existing Spark SQL functions?Pandas UDFs (Vectorized UDFs)Pandas UDFs are a when Python logic is unavoidable.Operate on batches of dataReduce serialization overheadPerform significantly better than row-wise UDFsfrom pyspark.sql.functions import pandas_udf@pandas_udf(StringType())def risk_label_udf(rate_series): return rate_series.apply(lambda r: “high” if r and r > 15 else “normal”)Still not as optimizable as built-ins — but much faster than classic UDFs.You can register a UDF for SQL usage:spark.udf.register(“risk_label_udf”, risk_udf)SELECT risk_label_udf(loan_int_rate) FROM credit_riskThis does not change performance characteristics.Using UDFs for simple arithmeticReplacing when / case whenApplying UDFs before filteringThese patterns usually indicate missed Spark SQL functionality.A Simple Rule That Saves YouIf your logic can be written using Spark SQL expressions, .You’ve exhausted built-in optionsYou understand the performance costUDFs are a , not a default tool.User-Defined Table Functions (UDTFs) extend Spark SQL beyond “one input → one output” logic.Unlike UDFs, which return a single value per row, UDTFs can return multiple rows for a single input row. This makes them useful — but also dangerous — because they can change the shape of your data.A UDTF is a programmable explode.VariantVal is Spark SQL’s answer to a long-standing pain point: semi-structured data that doesn’t fit neatly into a fixed schema.It is a flexible, schema-aware container for semi-structured values.Preserve structure where possibleDelay full schema resolutionProtocol Buffers (Protobuf) are a strongly typed, binary serialization format designed for efficiency and cross-language compatibility.In Spark SQL, Protobuf is used when:Data originates from strongly typed systemsSchema correctness is non-negotiablePerformance and compactness matterSpark SQL supports Python-based data sources, allowing data to be read into Spark using Python logic instead of native JVM connectors.Uses Python code to fetch or generate dataConverts it into a Spark DataFrameBypasses native JVM readersStateful processing introduces into Spark SQL computations.Instead of treating each row independently, Spark can:Compute results over timeIt is commonly associated with:By this point, we’ve covered a wide surface area of Spark SQL — from its core abstractions to advanced features that only appear in mature pipelines.If you remember only a few things from this article, remember these:DataFrames are plans, not dataColumns are expressions, not valuesTransformations describe, actions executeGrouping reduces, windows annotateBuilt-in functions beat Python logicObservation is for measurement, not transformationConfiguration affects planning, not magicThese principles explain most Spark SQL behavior you’ll encounter in practice.

Similar Posts