doesnβt need to be that complicated. In this article, Iβll show you how to develop a basic, βstarterβ one that uses an Iceberg table on AWS S3 storage. Once the table is registered using AWS Glue, youβll be able to query and mutate it from Amazon Athena, including using:
-
Time travel queries,
-
Merging, updating and deleting data
-
Optimising and vacuuming your tables.
Iβll also show you how to inspect the same tables locally from **DuckDB, **and weβll also see how to use Glue/Spark to insert more table data.
Our example might be basic, but itβll showcase the setup, the different tools and the processes you can put in place to build up a more extensive data store. All modern cloud providers have equivalents of the AWS services Iβm discussing in this articβ¦
doesnβt need to be that complicated. In this article, Iβll show you how to develop a basic, βstarterβ one that uses an Iceberg table on AWS S3 storage. Once the table is registered using AWS Glue, youβll be able to query and mutate it from Amazon Athena, including using:
-
Time travel queries,
-
Merging, updating and deleting data
-
Optimising and vacuuming your tables.
Iβll also show you how to inspect the same tables locally from **DuckDB, **and weβll also see how to use Glue/Spark to insert more table data.
Our example might be basic, but itβll showcase the setup, the different tools and the processes you can put in place to build up a more extensive data store. All modern cloud providers have equivalents of the AWS services Iβm discussing in this article, so it should be fairly straightforward to replicate what I discuss here on Azure, Google Cloud, and others.
To make sure weβre all on the same page, here is a brief explanation of some of the key technologies weβll be using.
AWS Glue/Spark
AWS Glue is a fully managed, serverless ETL service from Amazon that streamlines data preparation and integration for analytics and machine learning. It automatically detects and catalogues metadata from various sources, such as S3, into a centralised Data Store. Additionally, it can create customisable Python-based Spark ETL scripts to execute these tasks on a scalable, serverless Apache Spark platform. This makes it great for building data lakes on Amazon S3, loading data into data warehouses like Amazon Redshift, and performing data cleaning and transformation. all without managing infrastructure.
AWS Athena
AWS Athena is an interactive query service that simplifies data analysis directly in Amazon S3 using standard SQL. As a serverless platform, thereβs no need to manage or provision servers; just point Athena at your S3 data, define your schema (usually with AWS Glue), and begin running SQL queries. Itβs frequently utilised for ad hoc analysis, reporting, and exploration of large datasets in formats such as CSV, JSON, ORC, or Parquet.
Iceberg tables
Iceberg tables are an open table format for datasets that provide database-like capabilities for data stored in data lakes, such as Amazon S3 object storage. Traditionally, on S3, you can create, read, and delete objects(files), but updating them is not possible. The Iceberg format addresses that limitation while also offering other benefits, including ACID transactions, schema evolution, hidden partitioning, and time-travel features.
DuckDB
DuckDB is an in-memory analytical database written in C++ and designed for analytical SQL workloads. Since its release a couple of years ago, it has grown in popularity and is now one of the premier data processing tools used by data engineers and scientists, thanks to its grounding in SQL, performance, and versatility.
Scenario overview
Letβs say you have been tasked with building a small βwarehouse-liteβ analytics table for order events, but you donβt want to adopt a heavyweight platform just yet. You need:
- Safe writes (no broken readers, no partial commits)
- Row-level changes (UPDATE/DELETE/MERGE, not only append)
- Point-in-time reads (for audits and debugging)
- Local analytics against production-accurate data for quick checks
What weβll build
- Create an Iceberg table in Glue & S3 via Athena
- Load and mutate rows (INSERT/UPDATE/DELETE/MERGE)
- Time travel to prior snapshots (by timestamp and by snapshot ID)
- Keep it fast with OPTIMIZE and VACUUM
- Read the same table locally from DuckDB (S3 access via DuckDB Secrets)
- See how to add new records to our table using Glue Spark code
So, in a nutshell, weβll be using:-
- S3 for data storage
- Glue Catalogue for table metadata/discovery
- Athena for serverless SQL reads and writes
- DuckDB for cheap, local analytics against the same Iceberg table
- Spark** **for processing grunt
The key takeaway from our perspective is that by using the above technologies, we will be able to perform database-like queries on object storage.
Setting up our development environment
I prefer to isolate local tooling in a separate environment. Use any tool you like to do this; Iβll show using conda since thatβs what I usually do. For demo purposes, Iβll be running all the code within a Jupyter Notebook environment.
# create and activate a local env
conda create -n iceberg-demo python=3.11 -y
conda activate iceberg-demo
# install duckdb CLI + Python package and awscli for quick tests
pip install duckdb awscli jupyter
Prerequisites
As weβll be using AWS services, youβll need an AWS account. Also,
- An S3 bucket for the data lake (e.g.,
s3://my-demo-lake/warehouse/) - A Glue database (weβll create one)
- Athena Engine Version 3** i**n your workgroup
- An IAM role or user for Athena with S3 + Glue permissions
1/ Athena setup
Once youβve signed into AWS, open Athena in the console and set your workgroup, engine version and S3 output location (for query results). To do this, look for a hamburger-style menu icon on the top left of the Athena home screen. Click on it to bring up a new menu block on the left. In there, you should see an Administration-> Workgroups link. You will automatically be assigned to the primary workgroup. You can stick with this or create a new one if you like. Whichever option you choose, edit it and ensure that the following options are selected.
- Analytics Engine β Athena SQL. Manually set the engine version to 3.0.
- Select customer-managed query result configuration and enter the required bucket and account information.
2/ Create an Iceberg table in Athena
Weβll store order events and let Iceberg manage partitioning transparently. Iβll use a βhiddenβ partition on the day of the timestamp to spread writes/reads. Go back to the Athena home page and launch the Trino SQL query editor. Your screen should look like this.
Image from AWS website
Type in and run the following SQL. Change bucket/table names to suit.
-- This automatically creates a Glue database
-- if you don't have one already
CREATE DATABASE IF NOT EXISTS analytics;
CREATE TABLE analytics.sales_iceberg (
order_id bigint,
customer_id bigint,
ts timestamp,
status string,
amount_usd double
)
PARTITIONED BY (day(ts))
LOCATION 's3://your_bucket/warehouse/sales_iceberg/'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
)
3) Load and mutate data (INSERT / UPDATE / DELETE / MERGE)
Athena supports real Iceberg DML, allowing you to insert rows, update and delete records, and upsert using the MERGE statement. Under the hood, Iceberg uses snapshot-based ACID with delete files; readers stay consistent while writers work in parallel.
Seed a few rows.
INSERT INTO analytics.sales_iceberg VALUES
(101, 1, timestamp '2025-08-01 10:00:00', 'created', 120.00),
(102, 2, timestamp '2025-08-01 10:05:00', 'created', 75.50),
(103, 2, timestamp '2025-08-02 09:12:00', 'created', 49.99),
(104, 3, timestamp '2025-08-02 11:47:00', 'created', 250.00);
A quick sanity check.
SELECT * FROM analytics.sales_iceberg ORDER BY order_id;
order_id | customer_id | ts | status | amount_usd
----------+-------------+-----------------------+----------+-----------
101 | 1 | 2025-08-01 10:00:00 | created | 120.00
102 | 2 | 2025-08-01 10:05:00 | created | 75.50
103 | 2 | 2025-08-02 09:12:00 | created | 49.99
104 | 3 | 2025-08-02 11:47:00 | created | 250.00
Update and delete.
UPDATE analytics.sales_iceberg
SET status = 'paid'
WHERE order_id IN (101, 102)
-- removes order 103
DELETE FROM analytics.sales_iceberg
WHERE status = 'created' AND amount_usd < 60
Idempotent upserts with MERGE
Letβs treat order 104 as refunded and create a new order 105.
MERGE INTO analytics.sales_iceberg AS t
USING (
VALUES
(104, 3, timestamp '2025-08-02 11:47:00', 'refunded', 250.00),
(105, 4, timestamp '2025-08-03 08:30:00', 'created', 35.00)
) AS s(order_id, customer_id, ts, status, amount_usd)
ON s.order_id = t.order_id
WHEN MATCHED THEN
UPDATE SET
customer_id = s.customer_id,
ts = s.ts,
status = s.status,
amount_usd = s.amount_usd
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, ts, status, amount_usd)
VALUES (s.order_id, s.customer_id, s.ts, s.status, s.amount_usd);
You can now re-query to see: 101/102 β paid, 103 deleted, 104 β refunded, and 105 β created. (If youβre running this in a βrealβ account, youβll notice the S3 object count ticking up β more on maintenance shortly.)
SELECT * FROM analytics.sales_iceberg ORDER BY order_id
# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 105 4 2025-08-03 08:30:00.000000 created 35.0
3 102 2 2025-08-01 10:05:00.000000 paid 75.5
4 104 3 2025-08-02 11:47:00.000000 refunded 250.0
4) Time travel (and version travel)
This is where the real value of using Iceberg shines. You can query the table as it looked at a moment in time or by a specific snapshot ID. In Athena, use this syntax,
-- Time travel to noon on Aug 2 (UTC)
SELECT order_id, status, amount_usd
FROM analytics.sales_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2025-08-02 12:00:00 UTC'
ORDER BY order_id;
-- Or Version travel (replace the id with an actual snapshot id from your table)
SELECT *
FROM analytics.sales_iceberg
FOR VERSION AS OF 949530903748831860;
To get the various version (snapshot) IDs associated with a particular table, use this query.
SELECT * FROM "analytics"."sales_iceberg$snapshots"
ORDER BY committed_at DESC;
5) Keeping your data healthy: OPTIMIZE and VACUUM
Row-level writes (UPDATE/DELETE/MERGE) create many delete files and can fragment data. Two statements keep things fast and storage-friendly:
- OPTIMIZE β¦ REWRITE DATA USING BIN_PACK β compacts small/fragmented files and folds deletes into data
- VACUUM β expires old snapshots + cleans orphan files
-- compact "hot" data (yesterday) and merge deletes
OPTIMIZE analytics.sales_iceberg
REWRITE DATA USING BIN_PACK
WHERE ts >= date_trunc('day', current_timestamp - interval '1' day);
-- expire old snapshots and remove orphan files
VACUUM analytics.sales_iceberg;
6) Local analytics with DuckDB (read-only)
Itβs great to be able to sanity-check production tables from a laptop without having to run a cluster. DuckDBβs httpfs + iceberg extensions make this simple.
6.1 Install & load extensions
Open your Jupyter notebook and type in the following.
# httpfs gives S3 support; iceberg adds Iceberg readers.
import duckdb as db
db.sql("install httpfs; load httpfs;")
db.sql("install iceberg; load iceberg;")
6.2 Provide S3 credentials to DuckDB the βrightβ way (Secrets)
DuckDB has a small but powerful secrets manager. The most robust setup in AWS is the credential chain provider, which reuses whatever the AWS SDK can find (environment variables, IAM role, etc.). Therefore, you will need to ensure that, for instance, your AWS CLI credentials are configured.
db.sql("""CREATE SECRET ( TYPE s3, PROVIDER credential_chain )""")
After that, any s3://β¦ reads in this DuckDB session will use the secret data.
6.3 Point DuckDB at the Iceberg tableβs metadata
The most explicit way is to reference a concrete metadata file (e.g., the latest one in your tableβs metadata/ folder:)
To get a list of those, use this query
result = db.sql("""
SELECT *
FROM glob('s3://your_bucket/warehouse/**')
ORDER BY file
""")
print(result)
...
...
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00000-942a25ce-24e5-45f8-ae86-b70d8239e3bb.metadata.json β
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00001-fa2d9997-590e-4231-93ab-642c0da83f19.metadata.json β
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00002-0da3a4af-64af-4e46-bea2-0ac450bf1786.metadata.json β
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00003-eae21a3d-1bf3-4ed1-b64e-1562faa445d0.metadata.json β
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json
...
...
...
Look for the metadata.json file with the highest numbered start to the file name, 00004 in my case. Then, you can use that in a query like this to retrieve the latest position of your underlying table.
# Use the highest numbered metadata file (00004 appears to be the latest in my case)
result = db.sql("""
SELECT *
FROM iceberg_scan('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
LIMIT 10
""")
print(result)
ββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββ¬βββββββββββββ
β order_id β customer_id β ts β status β amount_usd β
β int64 β int64 β timestamp β varchar β double β
ββββββββββββΌββββββββββββββΌββββββββββββββββββββββΌβββββββββββΌβββββββββββββ€
β 105 β 4 β 2025-08-03 08:30:00 β created β 35.0 β
β 104 β 3 β 2025-08-02 11:47:00 β refunded β 250.0 β
β 101 β 1 β 2025-08-01 10:00:00 β paid β 120.0 β
β 102 β 2 β 2025-08-01 10:05:00 β paid β 75.5 β
ββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββ΄βββββββββββββ
Want a specific snapshot? Use this to get a list.
result = db.sql("""
SELECT *
FROM iceberg_snapshots('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
""")
print("Available Snapshots:")
print(result)
Available Snapshots:
βββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β sequence_number β snapshot_id β timestamp_ms β manifest_list β
β uint64 β uint64 β timestamp β varchar β
βββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1 β 5665457382547658217 β 2025-09-09 10:58:44.225 β s3://your_bucket/warehouse/sales_iceberg/metadata/snap-5665457382547658217-1-bb7d0497-0f97-4483-98e2-8bd26ddcf879.avro β
β 3 β 8808557756756599285 β 2025-09-09 11:19:24.422 β s3://your_bucket/warehouse/sales_iceberg/metadata/snap-8808557756756599285-1-f83d407d-ec31-49d6-900e-25bc8d19049c.avro β
β 2 β 31637314992569797 β 2025-09-09 11:08:08.805 β s3://your_bucket/warehouse/sales_iceberg/metadata/snap-31637314992569797-1-000a2e8f-b016-4d91-9942-72fe9ddadccc.avro β
β 4 β 4009826928128589775 β 2025-09-09 11:43:18.117 β s3://your_bucket/warehouse/sales_iceberg/metadata/snap-4009826928128589775-1-cd184303-38ab-4736-90da-52e0cf102abf.avro β
βββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7) Optional extra: Writing from Spark/Glue
If you prefer Spark for larger batch writes, Glue can read/write Iceberg tables registered in the Glue Catalogue. Youβll probably still want to use Athena for ad-hoc SQL, time travel, and maintenance, but large CTAS/ETL can come via Glue jobs. (Just be aware that version compatibility and AWS LakeFormation permissions can bite, as Glue and Athena may lag slightly on Iceberg versions.)
Hereβs an example of some Glue Spark code that inserts a few new data rows, starting at order_id = 110, into our existing table. Before running this, you should add the following Glue job parameter (under Glue Job Details-> Advanced Parameters-> Job parameters.
Key: --conf
Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
import sys
import random
from datetime import datetime
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row
# --------------------------------------------------------
# Init Glue job
# --------------------------------------------------------
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# --------------------------------------------------------
# Force Iceberg + Glue catalog configs (dynamic only)
# --------------------------------------------------------
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://your_bucket/warehouse/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.sql.defaultCatalog", "glue_catalog")
# --------------------------------------------------------
# Debug: list catalogs to confirm glue_catalog is registered
# --------------------------------------------------------
print("Current catalogs available:")
spark.sql("SHOW CATALOGS").show(truncate=False)
# --------------------------------------------------------
# Read existing Iceberg table (optional)
# --------------------------------------------------------
existing_table_df = glueContext.create_data_frame.from_catalog(
database="analytics",
table_name="sales_iceberg"
)
print("Existing table schema:")
existing_table_df.printSchema()
# --------------------------------------------------------
# Create 5 new records
# --------------------------------------------------------
new_records_data = []
for i in range(5):
order_id = 110 + i
record = {
"order_id": order_id,
"customer_id": 1000 + (i % 10),
"price": round(random.uniform(10.0, 500.0), 2),
"created_at": datetime.now(),
"status": "completed"
}
new_records_data.append(record)
new_records_df = spark.createDataFrame([Row(**r) for r in new_records_data])
print(f"Creating {new_records_df.count()} new records:")
new_records_df.show()
# Register temp view for SQL insert
new_records_df.createOrReplaceTempView("new_records_temp")
# --------------------------------------------------------
# Insert into Iceberg table (alias columns as needed)
# --------------------------------------------------------
spark.sql("""
INSERT INTO analytics.sales_iceberg (order_id, customer_id, ts, status, amount_usd)
SELECT order_id,
customer_id,
created_at AS ts,
status,
price AS amount_usd
FROM new_records_temp
""")
print(" Sccessfully added 5 new records to analytics.sales_iceberg")
# --------------------------------------------------------
# Commit Glue job
# --------------------------------------------------------
job.commit()
Double-check with Athena.
select * from analytics.sales_iceberg
order by order_id
# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 102 2 2025-08-01 10:05:00.000000 paid 75.5
3 104 3 2025-08-02 11:47:00.000000 refunded 250.0
4 105 4 2025-08-03 08:30:00.000000 created 35.0
5 110 1000 2025-09-10 16:06:45.505935 completed 248.64
6 111 1001 2025-09-10 16:06:45.505947 completed 453.76
7 112 1002 2025-09-10 16:06:45.505955 completed 467.79
8 113 1003 2025-09-10 16:06:45.505963 completed 359.9
9 114 1004 2025-09-10 16:06:45.506059 completed 398.52
Future Steps
From here, you could:
-
Create more tables with data.
-
Add schema evolutionβ
-
Experiment with partition evolution (e.g., change table partition from day β hour as volumes grow),
-
Add scheduled maintenance. For example, EventBridge, Step, and Lambdas could be used to run OPTIMIZE/VACUUM on a scheduled cadence.
Summary
In this article, Iβve tried to provide a clear path for building an Iceberg data lakehouse on AWS. It should serve as a guide for data engineers who want to connect simple object storage with complex enterprise data warehouses.
Hopefully, Iβve shown that building a Data Lakehouseβa system that combines the low cost of data lakes with the transactional integrity of warehousesβdoesnβt necessarily require extensive infrastructure deployment. And while creating a full lakehouse is something that evolves over a long time, I hope Iβve convinced you that you really can make the bones of one in an afternoon.
By leveraging Apache Iceberg on a cloud storage system like Amazon S3, I demonstrated how to transform static files into dynamic, managed tables capable of ACID transactions, row-level mutations (MERGE, UPDATE, DELETE), and time travel, all without provisioning a single server.
I also showed that by using new analytic tools such as DuckDB, itβs possible to read small to medium data lakes locally. And when your data volumes grow and get too big for local processing, I showed how easy it was to step up to an enterprise class data processing platform like Spark.