There’s a disconnect at the center of most AI and analytics infrastructure. Whether it’s GPU-accelerated model training or petabyte-scale Spark and Presto queries, modern workloads have pushed compute, networking, and orchestration forward. But data storage often remains anchored to architectures designed decades ago for different workloads. The result: state-of-the-art infrastructure bottlenecked by storage systems that can’t keep pace with the concurrency, metadata scale, and throughput these workloads demand.
To understand those tradeoffs, start with the data itself: AI training datasets, model checkpoints, Parquet files, Iceberg tables, embeddings, logs. All of it is immutable, written once, read many times, and accessed by hundreds or thousands of concurrent processes at scale. It doesn’t require in-place modification or hierarchical directory structures. It requires atomic operations, consistent metadata, and lock-free concurrency across billions of objects. Object storage was purpose-built for exactly this. File systems were not.
Where File System Architecture Breaks Down
So what happens when organizations run these workloads on file-based architectures anyway? The symptoms are predictable. Training jobs that completed reliably now stall. Spark queries timeout. LIST operations slow from milliseconds to seconds. Engineers trace the problem through the application, the network, the orchestration layer — everything checks out. The bottleneck is in storage, but not capacity or bandwidth. Something deeper.
Here’s what’s actually happening. Every request to a file system triggers a sequence of operations: path resolution through directories, inode lookups, lock coordination, metadata updates. A LIST doesn’t scan a flat index. Instead, it traverses a hierarchy, touching metadata at every level. At scale, these operations compound. Locks serialize work that should run in parallel. Directory traversals add latency to every metadata call. The bottleneck isn’t a component. It’s the architecture.
The Multi-Protocol Promise Doesn’t Deliver
The pitch is compelling. One platform that supports object, file, and sometimes even block. S3 for modern AI workloads, NFS for legacy applications. Vendors like NetApp, Pure Storage, and Dell Technologies market this as the best of both worlds. But the reality is the opposite: you inherit the limitations of both. At scale, these limitations become severe. Scale makes it worse, but the right architecture matters at any size. Translation layers add latency. Gateways add complexity. And troubleshooting becomes harder the moment you’re debugging across two paradigms bolted together.