5 min readJust now
–
TL;DR: I built apache-iceberg-fileio, a custom FileIO implementation that stores Iceberg metadata in a database instead of object storage. The current implementation uses PostgreSQL and provides consistent low-latency metadata access, especially useful for streaming ingestion workloads.
Apache Iceberg has become the de facto standard for large-scale analytics tables, offering features like schema evolution, time travel, and partition evolution. But one aspect that often gets overlooked is where Iceberg stores its metadata files.
By default, most Iceberg deployments store metadata files (metadata.json, manifest lists, and manifest files) in object storage like S3, GCS, or Azure Blob Storage. Whi…
5 min readJust now
–
TL;DR: I built apache-iceberg-fileio, a custom FileIO implementation that stores Iceberg metadata in a database instead of object storage. The current implementation uses PostgreSQL and provides consistent low-latency metadata access, especially useful for streaming ingestion workloads.
Apache Iceberg has become the de facto standard for large-scale analytics tables, offering features like schema evolution, time travel, and partition evolution. But one aspect that often gets overlooked is where Iceberg stores its metadata files.
By default, most Iceberg deployments store metadata files (metadata.json, manifest lists, and manifest files) in object storage like S3, GCS, or Azure Blob Storage. While this works, it comes with a hidden cost that many teams discover only after running Iceberg in production.
The Problem with Object Storage for Metadata
Iceberg generates numerous small metadata files. Every commit creates new metadata files, and query planning requires reading multiple manifest files. When these files live in S3 or similar object storage, you’re at the mercy of highly variable latencies.
I’ve seen S3 GET requests range anywhere from 20ms to 500ms+ for the same file size. When query planning requires reading dozens of small metadata files sequentially, these latency spikes compound into noticeable query delays.
This variability isn’t a bug — it’s the nature of distributed object storage systems. But for metadata access patterns (frequent reads of small files), it’s far from ideal.
What About Iceberg’s Built-in Caching?
You might wonder: doesn’t Iceberg have caching to mitigate this? Yes, but with important limitations.
CachingCatalog caches Table objects in memory — not the underlying metadata files. When you call catalog.loadTable(), it can return a cached Table reference. But here’s the catch: the cached Table still needs to refresh its metadata. Each refresh() call re-reads the metadata.json file from storage via FileIO. The Table object is cached; the metadata files are not.
The most resource-intensive part of loading an Iceberg table is fetching metadata from object storage and parsing it. The caching layer tries to reduce this overhead, but it can’t help when the metadata files themselves keep changing.
ContentCache provides file-level caching for manifest files, which helps with repeated reads of the same manifests. However, it’s disabled by default (via [CatalogProperties\.IO\_MANIFEST\_CACHE\_ENABLED](https://iceberg.apache.org/javadoc/1.10.0/org/apache/iceberg/CatalogProperties.html#IO_MANIFEST_CACHE_ENABLED)), and even when enabled, it doesn’t cache the metadata.json files that change on every commit.
BaseMetastoreTableOperations caches TableMetadata in memory, but this cache is invalidated whenever the table is refreshed — which happens frequently in active tables.
The Streaming Ingestion Problem
This caching gap becomes particularly painful in streaming ingestion systems. When you’re making frequent, small commits (common in Kafka-to-Iceberg pipelines), every commit:
- Writes a new
metadata.jsonfile (S3 PUT) - Writes a new manifest list file (S3 PUT)
- Writes new manifest files (S3 PUT)
These new files don’t exist in any cache — they’re brand new. With S3’s variable latency, a commit that should complete in milliseconds can spike to seconds. String enough of these together in a streaming pipeline, and you’ve got a throughput bottleneck.
Reads suffer too. When consumers query the latest data (common in streaming scenarios), they must:
- Read the latest
metadata.jsonto get the current snapshot (S3 GET) - Read the manifest list file (S3 GET)
- Read manifest files for query planning (S3 GET)
Every table refresh to see fresh data hits S3 with the same variable latency. In near real-time analytics, where consumers constantly poll for new data, these repeated S3 reads compound the latency problem.
The fundamental issue: Iceberg’s caching helps with repeated reads of stable data, but streaming workloads constantly generate new metadata files that bypass the cache entirely.
A Different Approach: Database-Backed FileIO
What if we stored Iceberg metadata in a database instead? Databases like PostgreSQL are optimized for exactly this workload: consistent, low-latency access to small pieces of data.
I built apache-iceberg-fileio, a custom FileIO implementation that stores Iceberg metadata files directly in PostgreSQL. The results have been promising:
- Consistent latency: PostgreSQL queries typically complete in single-digit or low double-digit milliseconds, with far less variance than object storage
- Simplified operations: If you’re already using JDBC Catalog with PostgreSQL, your entire metadata layer now lives in one system
- Standard backup/restore:
pg_dumpandpg_restorecapture your complete catalog state
Architecture
The implementation follows a clean separation of concerns:
┌─────────────────────────────────────────────────────────┐│ Apache Iceberg ││ (FileIO) │└─────────────────────────┬───────────────────────────────┘ │┌─────────────────────────▼───────────────────────────────┐│ BlobFileIO ││ (core module - storage agnostic) │└─────────────────────────┬───────────────────────────────┘ │┌─────────────────────────▼───────────────────────────────┐│ BlobStorageClient ││ (interface) │└─────────────────────────┬───────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ PostgreSQL │ │ MySQL │ │ Custom ││ (MyBatis) │ │ (future) │ │ Backend │└─────────────┘ └─────────────┘ └─────────────┘
The BlobStorageClient interface is intentionally simple:
public interface BlobStorageClient extends AutoCloseable { boolean exists(String path); FileEntry getFile(String path); void putFile(FileEntry entry, boolean overwrite); void deleteFile(String path);}
This makes it straightforward to add support for other databases or storage backends.
Usage
Getting started is simple. Add the Maven dependency:
<dependency> <groupId>io.github.udaysagar2177</groupId> <artifactId>apache-iceberg-fileio-postgres-mybatis</artifactId> <version>0.0.1</version></dependency>
Then initialize the FileIO:
Map<String, String> properties = new HashMap<>();properties.put("iceberg.fileio.postgres.url", "jdbc:postgresql://localhost:5432/my_catalog");properties.put("iceberg.fileio.postgres.user", "postgres");properties.put("iceberg.fileio.postgres.password", "password");
// Optional: enable Snappy compressionproperties.put("iceberg.fileio.postgres.compression.enabled", "true");MyBatisPostgresClient client = new MyBatisPostgresClient();client.initialize(properties);BlobFileIO fileIO = new BlobFileIO(client);fileIO.initialize(properties);
The implementation automatically creates the required table on first use.
Compression Support
Iceberg metadata files compress well. The implementation supports optional Snappy compression, which typically achieves 60–80% reduction in storage size for JSON metadata files. This keeps your database lean and reduces I/O.
Is This Right For You?
This approach isn’t for everyone. The sweet spot is streaming ingestion workloads where you’re making frequent small commits and consistent latency matters more than absolute scale.
When to Use This
- Small to medium-sized tables: Moderate partition counts (hundreds to low thousands)
- Streaming ingestion: Frequent small commits (Kafka-to-Iceberg pipelines, CDC streams)
- Latency-sensitive workloads: Query planning and commit performance matters
- JDBC Catalog users: Already running PostgreSQL for your catalog
- Operational simplicity: Want unified backup/restore with
pg_dumpandpg_restore
When NOT to Use This
- Very large tables: Millions of partitions or thousands of manifests per snapshot
- Large metadata files: Tables regularly exceeding 100MB+ metadata files
- Enterprise scale: Multi-petabyte tables with extreme partition counts
PostgreSQL’s BYTEA type has practical limits (~500MB-1GB per file). Large enterprises with massive tables should continue using object storage. Iceberg chose object storage as the default for good reason — it scales to extreme table sizes that databases cannot handle efficiently.
Other Considerations
Not for data files: This FileIO is designed for metadata only. Your actual data files (Parquet, ORC, Avro) should still live in object storage or HDFS.
Database dependency: Your metadata availability depends on PostgreSQL uptime. For teams already running JDBC Catalog on PostgreSQL, this isn’t a new dependency. PostgreSQL’s replication and high-availability features are mature and well-understood.
Extending to Other Backends
The architecture is intentionally extensible. Implementing support for MySQL, CockroachDB, or any other storage backend requires implementing the BlobStorageClient interface — four methods plus initialization and cleanup.
Try It Out
The project is available on GitHub: https://github.com/udaysagar2177/apache-iceberg-fileio
The Maven artifacts are published to Maven Central under io.github.udaysagar2177.
If you’re running Iceberg in production and have experienced the pain of variable S3 latencies during query planning, give this approach a try. I’d love to hear about your experience.