Bootstrap a Data Lakehouse in an Afternoon

doesn’t need to be that complicated. In this article, I’ll show you how to develop a basic, “starter” one that uses an Iceberg table on AWS S3 storage. Once the table is registered using AWS Glue, you’ll be able to query and mutate it from Amazon Athena, including using:

Time travel queries,
Merging, updating and deleting data
Optimising and vacuuming your tables.

I’ll also show you how to inspect the same tables locally from **DuckDB, **and we’ll also see how to use Glue/Spark to insert more table data.

Our example might be basic, but it’ll showcase the setup, the different tools and the processes you can put in place to build up a more extensive data store. All modern cloud providers have equivalents of the AWS services I’m discussing in this artic…

Time travel queries,
Merging, updating and deleting data
Optimising and vacuuming your tables.

I’ll also show you how to inspect the same tables locally from **DuckDB, **and we’ll also see how to use Glue/Spark to insert more table data.

To make sure we’re all on the same page, here is a brief explanation of some of the key technologies we’ll be using.

AWS Glue/Spark

AWS Glue is a fully managed, serverless ETL service from Amazon that streamlines data preparation and integration for analytics and machine learning. It automatically detects and catalogues metadata from various sources, such as S3, into a centralised Data Store. Additionally, it can create customisable Python-based Spark ETL scripts to execute these tasks on a scalable, serverless Apache Spark platform. This makes it great for building data lakes on Amazon S3, loading data into data warehouses like Amazon Redshift, and performing data cleaning and transformation. all without managing infrastructure.

AWS Athena

AWS Athena is an interactive query service that simplifies data analysis directly in Amazon S3 using standard SQL. As a serverless platform, there’s no need to manage or provision servers; just point Athena at your S3 data, define your schema (usually with AWS Glue), and begin running SQL queries. It’s frequently utilised for ad hoc analysis, reporting, and exploration of large datasets in formats such as CSV, JSON, ORC, or Parquet.

Iceberg tables

Iceberg tables are an open table format for datasets that provide database-like capabilities for data stored in data lakes, such as Amazon S3 object storage. Traditionally, on S3, you can create, read, and delete objects(files), but updating them is not possible. The Iceberg format addresses that limitation while also offering other benefits, including ACID transactions, schema evolution, hidden partitioning, and time-travel features.

DuckDB

DuckDB is an in-memory analytical database written in C++ and designed for analytical SQL workloads. Since its release a couple of years ago, it has grown in popularity and is now one of the premier data processing tools used by data engineers and scientists, thanks to its grounding in SQL, performance, and versatility.

Scenario overview

Let’s say you have been tasked with building a small “warehouse-lite” analytics table for order events, but you don’t want to adopt a heavyweight platform just yet. You need:

Safe writes (no broken readers, no partial commits)
Row-level changes (UPDATE/DELETE/MERGE, not only append)
Point-in-time reads (for audits and debugging)
Local analytics against production-accurate data for quick checks

What we’ll build

Create an Iceberg table in Glue & S3 via Athena
Load and mutate rows (INSERT/UPDATE/DELETE/MERGE)
Time travel to prior snapshots (by timestamp and by snapshot ID)
Keep it fast with OPTIMIZE and VACUUM
Read the same table locally from DuckDB (S3 access via DuckDB Secrets)
See how to add new records to our table using Glue Spark code

So, in a nutshell, we’ll be using:-

S3 for data storage
Glue Catalogue for table metadata/discovery
Athena for serverless SQL reads and writes
DuckDB for cheap, local analytics against the same Iceberg table
Spark** **for processing grunt

The key takeaway from our perspective is that by using the above technologies, we will be able to perform database-like queries on object storage.

Setting up our development environment

I prefer to isolate local tooling in a separate environment. Use any tool you like to do this; I’ll show using conda since that’s what I usually do. For demo purposes, I’ll be running all the code within a Jupyter Notebook environment.

# create and activate a local env
conda create -n iceberg-demo python=3.11 -y
conda activate iceberg-demo

# install duckdb CLI + Python package and awscli for quick tests
pip install duckdb awscli jupyter

Prerequisites

As we’ll be using AWS services, you’ll need an AWS account. Also,

An S3 bucket for the data lake (e.g., s3://my-demo-lake/warehouse/)
A Glue database (we’ll create one)
Athena Engine Version 3** i**n your workgroup
An IAM role or user for Athena with S3 + Glue permissions

1/ Athena setup

Once you’ve signed into AWS, open Athena in the console and set your workgroup, engine version and S3 output location (for query results). To do this, look for a hamburger-style menu icon on the top left of the Athena home screen. Click on it to bring up a new menu block on the left. In there, you should see an Administration-> Workgroups link. You will automatically be assigned to the primary workgroup. You can stick with this or create a new one if you like. Whichever option you choose, edit it and ensure that the following options are selected.

Analytics Engine — Athena SQL. Manually set the engine version to 3.0.
Select customer-managed query result configuration and enter the required bucket and account information.

2/ Create an Iceberg table in Athena

We’ll store order events and let Iceberg manage partitioning transparently. I’ll use a “hidden” partition on the day of the timestamp to spread writes/reads. Go back to the Athena home page and launch the Trino SQL query editor. Your screen should look like this.

Image from AWS website

Type in and run the following SQL. Change bucket/table names to suit.

-- This automatically creates a Glue database
-- if you don't have one already
CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE analytics.sales_iceberg (
order_id    bigint,
customer_id bigint,
ts          timestamp,
status      string,
amount_usd  double
)
PARTITIONED BY (day(ts))
LOCATION 's3://your_bucket/warehouse/sales_iceberg/'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
)

3) Load and mutate data (INSERT / UPDATE / DELETE / MERGE)

Athena supports real Iceberg DML, allowing you to insert rows, update and delete records, and upsert using the MERGE statement. Under the hood, Iceberg uses snapshot-based ACID with delete files; readers stay consistent while writers work in parallel.

Seed a few rows.

INSERT INTO analytics.sales_iceberg VALUES
(101, 1, timestamp '2025-08-01 10:00:00', 'created', 120.00),
(102, 2, timestamp '2025-08-01 10:05:00', 'created',  75.50),
(103, 2, timestamp '2025-08-02 09:12:00', 'created',  49.99),
(104, 3, timestamp '2025-08-02 11:47:00', 'created', 250.00);

A quick sanity check.

SELECT * FROM analytics.sales_iceberg ORDER BY order_id;

order_id | customer_id |          ts           |  status  | amount_usd
----------+-------------+-----------------------+----------+-----------
101     | 1           | 2025-08-01 10:00:00   | created  | 120.00
102     | 2           | 2025-08-01 10:05:00   | created  |  75.50
103     | 2           | 2025-08-02 09:12:00   | created  |  49.99
104     | 3           | 2025-08-02 11:47:00   | created  | 250.00

Update and delete.

UPDATE analytics.sales_iceberg
SET status = 'paid'
WHERE order_id IN (101, 102)

-- removes order 103
DELETE FROM analytics.sales_iceberg
WHERE status = 'created' AND amount_usd < 60

Idempotent upserts with MERGE

Let’s treat order 104 as refunded and create a new order 105.

MERGE INTO analytics.sales_iceberg AS t
USING (
VALUES
(104, 3, timestamp '2025-08-02 11:47:00', 'refunded', 250.00),
(105, 4, timestamp '2025-08-03 08:30:00', 'created',   35.00)
) AS s(order_id, customer_id, ts, status, amount_usd)
ON s.order_id = t.order_id
WHEN MATCHED THEN
UPDATE SET
customer_id = s.customer_id,
ts = s.ts,
status = s.status,
amount_usd = s.amount_usd
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, ts, status, amount_usd)
VALUES (s.order_id, s.customer_id, s.ts, s.status, s.amount_usd);

You can now re-query to see: 101/102 → paid, 103 deleted, 104 → refunded, and 105 → created. (If you’re running this in a “real” account, you’ll notice the S3 object count ticking up — more on maintenance shortly.)

SELECT * FROM analytics.sales_iceberg ORDER BY order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 105 4 2025-08-03 08:30:00.000000 created 35.0
3 102 2 2025-08-01 10:05:00.000000 paid 75.5
4 104 3 2025-08-02 11:47:00.000000 refunded 250.0

4) Time travel (and version travel)

This is where the real value of using Iceberg shines. You can query the table as it looked at a moment in time or by a specific snapshot ID. In Athena, use this syntax,

-- Time travel to noon on Aug 2 (UTC)
SELECT order_id, status, amount_usd
FROM analytics.sales_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2025-08-02 12:00:00 UTC'
ORDER BY order_id;

-- Or Version travel (replace the id with an actual snapshot id from your table)

SELECT *
FROM analytics.sales_iceberg
FOR VERSION AS OF 949530903748831860;

To get the various version (snapshot) IDs associated with a particular table, use this query.

SELECT * FROM "analytics"."sales_iceberg$snapshots"
ORDER BY committed_at DESC;

5) Keeping your data healthy: OPTIMIZE and VACUUM

Row-level writes (UPDATE/DELETE/MERGE) create many delete files and can fragment data. Two statements keep things fast and storage-friendly:

OPTIMIZE … REWRITE DATA USING BIN_PACK — compacts small/fragmented files and folds deletes into data
VACUUM — expires old snapshots + cleans orphan files

-- compact "hot" data (yesterday) and merge deletes
OPTIMIZE analytics.sales_iceberg
REWRITE DATA USING BIN_PACK
WHERE ts >= date_trunc('day', current_timestamp - interval '1' day);

-- expire old snapshots and remove orphan files
VACUUM analytics.sales_iceberg;

6) Local analytics with DuckDB (read-only)

It’s great to be able to sanity-check production tables from a laptop without having to run a cluster. DuckDB’s httpfs + iceberg extensions make this simple.

6.1 Install & load extensions

Open your Jupyter notebook and type in the following.

# httpfs gives S3 support; iceberg adds Iceberg readers.

import duckdb as db
db.sql("install httpfs; load httpfs;")
db.sql("install iceberg; load iceberg;")

6.2 Provide S3 credentials to DuckDB the “right” way (Secrets)

DuckDB has a small but powerful secrets manager. The most robust setup in AWS is the credential chain provider, which reuses whatever the AWS SDK can find (environment variables, IAM role, etc.). Therefore, you will need to ensure that, for instance, your AWS CLI credentials are configured.

db.sql("""CREATE SECRET ( TYPE s3, PROVIDER credential_chain )""")

After that, any s3://… reads in this DuckDB session will use the secret data.

6.3 Point DuckDB at the Iceberg table’s metadata

The most explicit way is to reference a concrete metadata file (e.g., the latest one in your table’s metadata/ folder:)

To get a list of those, use this query

result = db.sql("""
SELECT *
FROM glob('s3://your_bucket/warehouse/**')
ORDER BY file
""")
print(result)

...
...
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00000-942a25ce-24e5-45f8-ae86-b70d8239e3bb.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00001-fa2d9997-590e-4231-93ab-642c0da83f19.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00002-0da3a4af-64af-4e46-bea2-0ac450bf1786.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00003-eae21a3d-1bf3-4ed1-b64e-1562faa445d0.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json
...
...
...

Look for the metadata.json file with the highest numbered start to the file name, 00004 in my case. Then, you can use that in a query like this to retrieve the latest position of your underlying table.

# Use the highest numbered metadata file (00004 appears to be the latest in my case)
result = db.sql("""
SELECT *
FROM iceberg_scan('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
LIMIT 10
""")
print(result)

┌──────────┬─────────────┬─────────────────────┬──────────┬────────────┐
│ order_id │ customer_id │         ts          │  status  │ amount_usd │
│  int64   │    int64    │      timestamp      │ varchar  │   double   │
├──────────┼─────────────┼─────────────────────┼──────────┼────────────┤
│      105 │           4 │ 2025-08-03 08:30:00 │ created  │       35.0 │
│      104 │           3 │ 2025-08-02 11:47:00 │ refunded │      250.0 │
│      101 │           1 │ 2025-08-01 10:00:00 │ paid     │      120.0 │
│      102 │           2 │ 2025-08-01 10:05:00 │ paid     │       75.5 │
└──────────┴─────────────┴─────────────────────┴──────────┴────────────┘

Want a specific snapshot? Use this to get a list.

result = db.sql("""
SELECT *
FROM iceberg_snapshots('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
""")
print("Available Snapshots:")
print(result)

Available Snapshots:
┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                          manifest_list                                                           │
│     uint64      │       uint64        │        timestamp        │                                                             varchar                                                              │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 5665457382547658217 │ 2025-09-09 10:58:44.225 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-5665457382547658217-1-bb7d0497-0f97-4483-98e2-8bd26ddcf879.avro │
│               3 │ 8808557756756599285 │ 2025-09-09 11:19:24.422 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-8808557756756599285-1-f83d407d-ec31-49d6-900e-25bc8d19049c.avro │
│               2 │   31637314992569797 │ 2025-09-09 11:08:08.805 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-31637314992569797-1-000a2e8f-b016-4d91-9942-72fe9ddadccc.avro   │
│               4 │ 4009826928128589775 │ 2025-09-09 11:43:18.117 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-4009826928128589775-1-cd184303-38ab-4736-90da-52e0cf102abf.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

7) Optional extra: Writing from Spark/Glue

If you prefer Spark for larger batch writes, Glue can read/write Iceberg tables registered in the Glue Catalogue. You’ll probably still want to use Athena for ad-hoc SQL, time travel, and maintenance, but large CTAS/ETL can come via Glue jobs. (Just be aware that version compatibility and AWS LakeFormation permissions can bite, as Glue and Athena may lag slightly on Iceberg versions.)

Here’s an example of some Glue Spark code that inserts a few new data rows, starting at order_id = 110, into our existing table. Before running this, you should add the following Glue job parameter (under Glue Job Details-> Advanced Parameters-> Job parameters.

Key: --conf
Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

import sys
import random
from datetime import datetime
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row

# --------------------------------------------------------
# Init Glue job
# --------------------------------------------------------
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# --------------------------------------------------------
# Force Iceberg + Glue catalog configs (dynamic only)
# --------------------------------------------------------
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://your_bucket/warehouse/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.sql.defaultCatalog", "glue_catalog")

# --------------------------------------------------------
# Debug: list catalogs to confirm glue_catalog is registered
# --------------------------------------------------------
print("Current catalogs available:")
spark.sql("SHOW CATALOGS").show(truncate=False)

# --------------------------------------------------------
# Read existing Iceberg table (optional)
# --------------------------------------------------------
existing_table_df = glueContext.create_data_frame.from_catalog(
database="analytics",
table_name="sales_iceberg"
)
print("Existing table schema:")
existing_table_df.printSchema()

# --------------------------------------------------------
# Create 5 new records
# --------------------------------------------------------
new_records_data = []
for i in range(5):
order_id = 110 + i
record = {
"order_id": order_id,
"customer_id": 1000 + (i % 10),
"price": round(random.uniform(10.0, 500.0), 2),
"created_at": datetime.now(),
"status": "completed"
}
new_records_data.append(record)

new_records_df = spark.createDataFrame([Row(**r) for r in new_records_data])
print(f"Creating {new_records_df.count()} new records:")
new_records_df.show()

# Register temp view for SQL insert
new_records_df.createOrReplaceTempView("new_records_temp")

# --------------------------------------------------------
# Insert into Iceberg table (alias columns as needed)
# --------------------------------------------------------
spark.sql("""
INSERT INTO analytics.sales_iceberg (order_id, customer_id, ts, status, amount_usd)
SELECT order_id,
customer_id,
created_at AS ts,
status,
price AS amount_usd
FROM new_records_temp
""")

print(" Sccessfully added 5 new records to analytics.sales_iceberg")

# --------------------------------------------------------
# Commit Glue job
# --------------------------------------------------------
job.commit()

Double-check with Athena.

select * from analytics.sales_iceberg
order by order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 102 2 2025-08-01 10:05:00.000000 paid 75.5
3 104 3 2025-08-02 11:47:00.000000 refunded 250.0
4 105 4 2025-08-03 08:30:00.000000 created 35.0
5 110 1000 2025-09-10 16:06:45.505935 completed 248.64
6 111 1001 2025-09-10 16:06:45.505947 completed 453.76
7 112 1002 2025-09-10 16:06:45.505955 completed 467.79
8 113 1003 2025-09-10 16:06:45.505963 completed 359.9
9 114 1004 2025-09-10 16:06:45.506059 completed 398.52

Future Steps

From here, you could:

Create more tables with data.
Add schema evolution’
Experiment with partition evolution (e.g., change table partition from day → hour as volumes grow),
Add scheduled maintenance. For example, EventBridge, Step, and Lambdas could be used to run OPTIMIZE/VACUUM on a scheduled cadence.

Summary

In this article, I’ve tried to provide a clear path for building an Iceberg data lakehouse on AWS. It should serve as a guide for data engineers who want to connect simple object storage with complex enterprise data warehouses.

Hopefully, I’ve shown that building a Data Lakehouse—a system that combines the low cost of data lakes with the transactional integrity of warehouses—doesn’t necessarily require extensive infrastructure deployment. And while creating a full lakehouse is something that evolves over a long time, I hope I’ve convinced you that you really can make the bones of one in an afternoon.

By leveraging Apache Iceberg on a cloud storage system like Amazon S3, I demonstrated how to transform static files into dynamic, managed tables capable of ACID transactions, row-level mutations (MERGE, UPDATE, DELETE), and time travel, all without provisioning a single server.

I also showed that by using new analytic tools such as DuckDB, it’s possible to read small to medium data lakes locally. And when your data volumes grow and get too big for local processing, I showed how easy it was to step up to an enterprise class data processing platform like Spark.