AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

In this video, AWS product managers Aditi and Manish, along with Principal Engineer Mark Roper, discuss high-performance storage solutions for AI/ML, analytics, and HPC workloads. They explain how storage bottlenecks prevent compute resources from scaling linearly, wasting 90-95% of workload spend on underutilized CPUs and GPUs. The session covers two main approaches: FSx for…

Overview

📖 AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

In this video, AWS product managers Aditi and Manish, along with Principal Engineer Mark Roper, discuss high-performance storage solutions for AI/ML, analytics, and HPC workloads. They explain how storage bottlenecks prevent compute resources from scaling linearly, wasting 90-95% of workload spend on underutilized CPUs and GPUs. The session covers two main approaches: FSx for Lustre for file-based workloads, delivering over 1 TB/s throughput with sub-millisecond latencies and new FSx Intelligent Tiering for automatic data management; and Amazon S3 Express One Zone for S3-native applications, providing 10x faster access and scaling to 2 million transactions per second. Real-world examples include Shell achieving 100% GPU utilization, Meta FAIR sustaining 140 Tbps with 1 million TPS, and Tavily cutting costs by 6x. The presentation also demonstrates how FSx for Lustre can link to S3 buckets, achieving 83% performance improvement in subsequent training runs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Critical Challenge of High-Performance Storage for Compute-Intensive Workloads

Hello, everyone. My name is Aditi. I’m a Senior Product Manager for Amazon FSx. I’m joined by Manish Talreja, who’s the Principal Product Manager for Amazon S3, and Mark Roper, who is the Principal Engineer for Amazon FSx. Mark will join Manish and me for Q&A after the talk.

All three of us have spent years working directly with customers running high-performance workloads and pushing the boundaries of what’s possible with storage. We’re going to share a little bit about what we’ve learned about high-performance storage for AI/ML, analytics, and HPC workloads. We’re going to walk through some real customer use cases and real-world examples, dive into technical capabilities that enable performance at scale, and introduce exciting new features along the way.

Let’s start with the fundamental question: Who needs high-performance storage? And the answer is a lot of different workloads across a wide range of industries. So think about machine learning teams running large language model training on massive datasets, data analysts querying petabytes of data interactively, or think about researchers running weather simulations or drug discovery with tens of thousands of cores, and the list goes on. But what ties all of these together are two things. First, these are compute-intensive workloads, so they require hundreds and thousands of cores, CPU or GPU resources. And second, they’re all data-intensive. They depend on fast, reliable access to massive scale of data. And that second part, that’s where things really get interesting.

Let me show you how. So we frequently hear from our customers that they love that with AWS they can spin up compute clusters, run their workload faster than ever before, and then spin them down once their workload’s done and stop paying for those resources. This is the magic of the cloud. But ideally you want this: as you add more compute, more CPU, more GPU resources, you get proportionally more work done. Ideal linear scaling, beautiful.

Here’s the problem though. What if you have a storage solution that cannot keep up with the performance requirements of your workload? In that case, the work done or your throughput plateaus, and you can keep throwing CPU and GPU resources, but the performance will not scale linearly with it. And the reason being that all those compute instances are now competing for access to the same data store, and that data storage has now become the bottleneck. This is specifically painful because we see that 90 to 95% of spend on these workloads is compute. So when your compute is sitting underutilized waiting for data, your time to run gets longer and your cost goes up. So in an ideal world when you’re architecting your solution, you would want your storage to scale linearly with your compute so that it never ends up becoming a bottleneck.

Why File Systems Remain Essential for High-Performance Computing

Before diving right into solutions of what we’ve built at AWS to address that very bottleneck, I want to acknowledge that customers come to AWS from two different paths. On the one hand, we have the customers that are running HPC, ML, and AI workloads on-premises for years. Their workloads are based on file-based access, and they want to maintain that paradigm and move to the cloud, gaining the benefits of the cloud. And on the other hand, we have the customers who started in the cloud from day one. So that means that they have their data stored in Amazon S3 and their applications are built around S3 APIs.

So just with a show of hands, who here is using file-based applications today? Awesome. And who here is using Amazon S3 as their data lake? That’s pretty good. That’s a great mix of customers, and this is also a very accurate representation of what we see with our customers in production today. So we’re going to cover each one of these, but let me start with lift-and-shift file system customers.

So the question remains, why do file systems remain the preferred choice for so many of these high-performance workloads? There are a couple of reasons, but the primary one is the familiar interface.

Researchers, data scientists, and developers know how to work with files and directories. It’s just more intuitive. The second reason is POSIX permissions. File systems give you granular access control. When you have multiple users accessing the same data, you want to make sure that you’re controlling who gets access to those files and who can write and execute those files.

The third reason is consistent data access. If all of your users are accessing the same file system, you want to make sure that the data they’re reading is consistent. Consistency is guaranteed, there are no stale reads, and that is where file systems are used. Back in 2018, we found that many customers wanted all of these benefits of file systems, along with the ease of use and scalability of the cloud. That is when we launched FSx for Lustre.

FSx for Lustre: Fully Managed and Elastic Cloud File System with Intelligent Tiering

FSx for Lustre is built on the open-source Lustre file system. So, any of you who’ve heard of Lustre before? Okay, that’s pretty good. Lustre is the open-source, high-performance file system. It is one of the most popular high-performance file systems. It is used by national labs across the world. It is used by machine learning, AI, and HPC applications on-premises as well. Applications range from training and inference all the way to weather modeling and genomic analysis.

What we’ve done is taken Lustre, which is really powerful but also notoriously complex to manage, and combined all the benefits of the fast and scalable Lustre file system with the management and ease of use of a cloud solution. We’ve offered a fully managed, fully elastic, and fast FSx for Lustre file system. Let me walk you through all of these, one by one, starting with fully managed.

What does fully managed mean? Fully managed means that this file system is tested and operated at an unprecedented scale. We’re also consistently monitoring all the hardware resources that are underpinning the file system. Assume that one of your servers runs into a hardware failure. In that case, we automatically monitor it, automatically detect it, and replace the server with a healthy server to keep your file system healthy at all times. We’ve also done the heavy lifting of ensuring that your file system is built on the latest and greatest technologies at AWS and the latest and greatest technologies offered by the open-source Lustre community, and make those available to you with full API support.

Now I want to talk about elasticity. We’ve made Lustre fully elastic for the first time in the cloud. We were hearing growing pains from customers as they were scaling their workloads in the cloud. Growing pains such as data does not grow linearly. For instance, if you have machine learning training runs, you generate terabytes of checkpoints, and then you clean them up. Your simulation spikes, and then you wrap those projects, so your data requirements are always going up and down. There’s this constant battle where you never want to run out of storage, but at the same time, you don’t want to pay for unused capacity.

The second thing is that not all your data is active data. You’re not actively using all of that data to be stored on the SSD file system. You need it for your hot data, but you still need your terabytes of checkpoints and your results stored, but not on the fast SSD. If it is stored on your faster SSD tier, it gets real expensive real fast. Think as you’re scaling your data set to petabyte scale. As you go to petabyte scale, it gets really hard to operationally manage, and the economics don’t scale either. That is why we launched FSx Intelligent Tiering earlier this year.

With FSx Intelligent Tiering, we offer virtually unlimited storage capacity, which means that the storage capacity grows and shrinks automatically based on your usage. You don’t have to worry about it. You never run out of storage, and you never have to pay for storage capacity that you’re not using. The second point is intelligent tiering between storage tiers.

We keep all your active data on your fast SSD and automatically tier your less accessed data to the colder, low-cost tiers, and we manage all of this. You don’t have to worry about it. We do it intelligently based on your access patterns. Finally, the economics of it are quite compelling too. For the colder tier, as an example, you pay 0.5 cents per gigabyte per month. To put this in perspective, your overall solution comes down to be 34% more price performant compared to your HDD solutions on premises.

Delivering Maximum Speed: Throughput, Metadata Performance, and Per-Instance Optimization

Now, let’s talk about speed. These are the performance levels we are delivering in production today. FSx for Lustre is the fastest storage for GPUs in the cloud, and it delivers the lowest latencies in the cloud. These performance metrics are what keeps your compute fully utilized. Let me walk you through how FSx for Lustre delivers this performance so that you can customize your file system based on your workload requirements. Each of the dimensions shown here, you can play with them independently, and that’s one of the ways you can make it even more price performant.

Here’s a simplified view. On your left, you’ve got your compute instance that’s running your workload, and on the right, you’ve got your storage, your file system. The file system has two kinds of servers. It has the metadata server and it has the storage server. Your metadata server is basically handling all your metadata requests. Think about creating files, listing directories, managing permissions. Your metadata server takes care of all of that. Then we have the object server. The object server is what handles your actual data reads and writes to and from the file system. Both of these servers are backed by storage disks, and when you talk to storage, your compute instance basically talks to this metadata server and storage server over the network.

Now I’m going to show you how each of these components comes into play when we talk about performance and how you can change those based on your workload requirements. Let’s start with throughput. When you’re using multiple client instances, you’re running thousands of training processes, simulation jobs, and they’re all accessing the same file system simultaneously. This means that you need more throughput, and by throughput I mean how many bytes per second you can read or write to the file system.

To deliver this throughput, FSx for Lustre file system basically scales out with multiple storage servers working in parallel to serve your requests. This parallel distribution across hundreds of storage servers is how we deliver over 1 terabyte per second of aggregate throughput. When all of your compute instances are talking to the file system, your requests are going to not one but multiple servers in parallel, and you’re getting the power of those multiple servers every time. That’s how you get over 1 terabyte per second of aggregate throughput. This enables fast parallel data access to hundreds of thousands of cores.

Now, most high-performance workloads are heavy on throughput and relatively light on metadata, but recently we started noticing that customers were using FSx for Lustre for more metadata-intensive workloads. You can think about home directories, user research workstations, and interactive applications where you’re constantly listing directories, opening files, checking permissions. Last year we launched a capability to scale out your metadata servers very similar to your storage servers. Your metadata servers can be scaled out independent of your storage servers and your storage capacity.

What this helps with is you get 15 times higher metadata IOPS when compared to before when you could not scale out. We’re further doubling down and making sure that you don’t run into metadata bottlenecks. Recently, we made an update to the Lustre software which enables 5 times higher directory listing performance. Now I walked you through aggregate throughput. I walked you through metadata IOPS of the file system as a whole. Now I want to talk about what happens when you have a single compute instance interacting with the file system. Why am I even talking about a single compute instance?

We have seen that EC2 instances on Amazon and across the board compute instances have really improved when it comes to their network bandwidth. We’ve gone from 25 gigabits to 200 to now 3,200 gigabits per second with P5 instances. All of this performance is great, but it’s not amazing when you cannot leverage it when talking to storage.

When you’re looking at a traditional file system, your traditional file system generally talks to your compute instances over TCP network and it only interacts with one NIC on the compute instance, which was true for FSx for Lustre as well two years back. You’re only interacting with one NIC, and because of that your throughput gets limited to 100 gigabits per second, and that means you’re leaving so much throughput on the table. So what we did was we launched EFA support. EFA is Elastic Fabric Adapter, which is built on Amazon’s SRD protocol. What it helps you do is it helps you communicate with the compute instance across multiple NICs, and it bypasses the operating system layer completely.

How that is helpful is now you’re not using a lot of your CPU cores in figuring out which NIC to communicate with, and it’s much faster. It can scale up to 700 gigabits per second. Obviously knowing the power of GPUs, we wanted to ensure that we’re utilizing it to the most optimal levels. We also support NVIDIA GPU Direct Storage. How many of you have heard about NVIDIA GPU Direct Storage? All right, we’ve got a lot of storage folks here. I love that.

So for the NVIDIA GPU Direct Storage, what happens is your compute instances can basically bypass the CPU and talk directly to your GPU memory. Think about a general data path going from your storage to your CPU memory to your GPU memory. That’s multiple hops that need to be managed. But what happens with GPU Direct is it allows the file system to communicate directly with your GPU memory, allowing us to give you a throughput of 3,200 gigabits per second per client. So 12x higher.

Achieving Sub-Millisecond Latencies and Real-World Results with Shell

All right, finally, let’s talk about latencies. When we built FSx for Lustre, we architected it to offer the lowest possible latencies on AWS. When I say latencies, we’re talking about how long it takes for a small operation to perform, so it affects the overall responsiveness of your storage solution. A lot of workloads we work with involve large compute clusters processing data, while at the same time end users or researchers are also manipulating that data interactively.

For these human-in-the-loop use cases, having fast responsive storage really matters because it should not feel like you’re sitting for seconds or minutes just to wait for your file to open or do an operation. That really matters for a researcher. Now how do we get sub-millisecond latencies beyond just having an SSD disk for your active training data? Three things.

First of all, FSx for Lustre is a zonal storage offering, so you can keep your file system in the same availability zone as your compute. That means just at a very low level, the distance that the data has to travel is really minimized, and that helps in improving latencies. Second is we do point-to-point communication, which is single network round trip. Your client server can directly talk to your file system server without multiple network hops.

What you would have seen with most storage solutions is there are multiple network hops through the server or there are load balancers. While they’re there for very good reasons, they do have an impact on the latency of the storage offering. The third one is client-side read and write caching. So with FSx for Lustre, I like to say that we consider the client as part of the file system.

Anytime you read and write, if you’re reading and writing similar files over and over again, FSx for Lustre can cache those on your client. In that case you’re basically having no network hops at all and even lower latencies. So FSx for Lustre allows you to get about zero to one network hops and sub-millisecond latencies.

I’ve thrown a lot of theoretical information at you, and now I want to share how this happens, how this works in production. Shell is a great example. Shell had a GPU-based on-premises environment where they were running into infrastructure bottlenecks, so they decided to burst into the cloud with FSx for Lustre and EC2.

They wanted to leverage the ability to scale compute and storage up and down based on their user requirements. Now how this helped was they were able to increase their GPU utilization from less than 90% to 100% in the cloud. So with FSx for Lustre, Shell was able to fully utilize the compute resources at full throttle. And this 10 to 11 percentage point difference might not seem like a lot, but those of you who are using P5 instances know that this really accumulates really fast.

Amazon S3 Express One Zone: Optimizing Latency and Transactions for Data Lake Workloads

And that concludes the file section part of the presentation. Now I want to hand over to Manish to share the options and optimizations for S3 Data Lake customers. Thanks Aditi. Where are my data lake customers? Great, how many of you have heard of Express One Zone? Awesome. A few.

So we’ll talk about how do you reduce storage bottlenecks with cloud object storage. Many of you already have petabytes of data stored on S3 data lakes. Now this is because you love the durability, the scalability, and the cost effectiveness. Now this is why S3 provides you with purpose-built storage classes, right? You’re starting from S3 Standard, that’s for your frequently accessed data, Glacier for your long-term archival, and Intelligent-Tiering when you have changing data access patterns. Now, this makes it very easy for you to source your data directly from S3.

But here’s the critical point. Your storage must keep up with your compute, and as Aditi mentioned earlier, compute is your most expensive resource. And when the storage cannot keep up, you will waste money on idled CPUs or GPUs. With that said, I want you to focus your attention on one particular storage class on the left. This is Amazon S3 Express One Zone. This is our fastest cloud object storage, and it’s built specifically for performance-critical applications. It will give you 10 times faster access and 80% lower request costs than S3 Standard. Now, let’s look at what makes it different.

First, Express One Zone is a single Availability Zone storage class, and we built it specifically to deliver consistent single-digit millisecond access for your most frequently accessed data. Now it instantly scales to hundreds of thousands of transactions per second with a new bucket type called directory buckets. Now for the rest of the talk we will focus on S3 Express One Zone, but some of the techniques that we talk about will apply to S3 Standard as well.

All right, so before we dive into optimization techniques, let’s think about a mental model that we will use. So when your client sends a request to S3, there are two parts to the request. The part on the left that’s in white is called overhead or time to first byte. This is where your client sets up the connection, sets up the authentication. There’s the round trip time. S3 is locating where your data is and then serving it up. No data is being transferred during this time. And then there’s the second part in pink. This is where the actual data transfer happens. We’re going to use this mental model for how we can evaluate your workloads.

Now there are two workloads on the screen right now, the top and bottom. They have the same overhead. This is the part in white. But notice the difference in the data payload. On the top, you will see a much smaller data payload. When your transfers look like this, your workload is latency sensitive. So think of key-value caching, your shuffle and sorting workloads, your real-time inference serving, and this is where your overhead matters a lot more. And on the bottom you’ll see a larger data payload compared to the overhead. Now when your transfers look like this, you care about the total transfer time or your throughput.

So now that we have a mental model, let’s talk about how we can optimize for different parameters, different performance parameters. Let’s start with latency. And we’re going to talk about two different techniques that you can use with S3 Express One Zone. The first one is co-locating your compute instances with S3 Express One Zone directory buckets. When you do that, your data travels shorter distances. You have much fewer network hops, and what that lets you do is reduce the latency that you have to the storage.

Now, co-locating might not always be possible. In these situations, you can still take advantage of the latency optimization, but you might pay a little bit of a penalty when you go across availability zones. So if your compute is in an adjacent availability zone from your directory bucket, there will be some impact.

The second technique is session-based authentication. This is a new technique that we introduced with directory buckets. Traditionally with S3 standard buckets, what you would use is IAM authentication. With S3 directory buckets, you would use something called session-based authentication. You would call an API called CreateSession, which will give you a token. Your application will cache this for a period of time and reuse that so that you don’t incur a latency hit when you are authorizing with a directory bucket.

Now we understand that not every application can change overnight, so you can still start with IAM authentication and then move over to session authentication when you want to optimize for latency. And whenever you’re diagnosing high latency, we always ask customers to look under the hood in their libraries and make sure that session authentication is enabled.

Okay, so latency often goes hand in hand with transactions per second. A lot of the workloads that you run with small latency require request-intensive operations. So think of large-scale analytics or shuffle sorting. Now with S3 general purpose buckets, if you’re familiar with how to optimize for higher TPS, you would do that through prefix management. The way S3 works is it scales with prefix, so it adds TPS capacity when you have constant load, and it will basically scale up after a period of time, but it takes time.

With Express One Zone, you don’t need to do any of that. We handle the partitioning under the hood. And for each directory bucket, we’ll give you 200,000 reads per second out of the box. And of course, we’re continually innovating on your behalf, so this year we added optimizations to go up to 2 million transactions per second.

Now, a couple of quick tips when you’re working with directory buckets. You should try to keep your directories dense. What does that mean? You should try to flatten your directory structure and keep the objects in the leaf nodes and put a lot more objects in the leaf nodes. And the second tip is don’t add entropy. Typically we advise you to add entropy in your prefixes for your S3 standard buckets. Here, this will actually slow you down. All right, let me share with you how this really works in practice.

Tavily is an AI infrastructure company, and they’re building a web access layer for agents and large language models. They managed a two-tier caching system, and their costs were rising. They would have had to double their investment for their hot cache layer. Instead, they chose S3 Express One Zone because it delivered the low latency and the cost efficiency they were looking for. This is for their hot cache layer. In addition, it just scaled automatically as their data grew.

Maximizing Throughput with Parallelization and Multi-NIC Support

You can see the results. They were able to cut costs by six times while improving performance and reliability across millions of user requests. So this is latency and transactions per second in play. Now let’s talk about high throughput.

Many of you are running workloads like large-scale ML training or research workloads. This is where data loading and checkpointing becomes important. You need massive throughput so that you keep your compute fully utilized. So one of the techniques we’re going to talk about is parallelization.

Now, since S3 scales horizontally, the way to increase throughput is by breaking up your requests across multiple connections. In this example over here, you can see that each connection can achieve 100 megabytes per second. Now, if you need gigabytes per second or terabytes per second, you open up multiple connections. So how do you accomplish this with APIs? When you’re downloading, use byte range gets. This is where you will download parts of the object, and then you’ll have to assemble that on your compute node.

If you want to upload, you use a similar technique called multipart uploads, where you would upload individual parts and S3 will assemble that for you. All right, so I’m very excited about this.

This example was mentioned by Andy in the keynote yesterday as well. Meta FAIR is a research organization that runs research workloads, including LLM training, across thousands of GPUs. They wanted to speed up their checkpointing and data loading. What they did was use S3 Express One Zone. We scaled up an availability zone with 60 petabytes of data in one AZ where their GPU clusters were located. They were able to sustain 140 terabits per second with over 1 million transactions per second. Isn’t that neat?

Let’s talk about the client-side optimization. We saw that some customers were not able to fully utilize the compute resources using the techniques that we just discussed. So we built the AWS Common Runtime, or CRT. Has anybody heard of CRT over here? Oh, not many people. This is great. Let’s talk about it, right? It’s a set of open source libraries, and what we did was we embedded these libraries right into all our SDKs, all our clients, all our connectors, so you get performance improvements with S3 right out of the box.

The great thing is you don’t have to change much. An example here is that CRT will deliver up to six times faster data transfer with CLI using some of the techniques that we just discussed. Just keep in mind that for certain larger instances, we enable the CRT by default. However, if you’re using some other instances, please look under the hood. You might have to enable this and opt into CRT so that you can take advantage of some of these techniques.

Similar to how Aditi spoke about multiple NICs and how you can increase the per-instance throughput, we have a similar feature in AWS CRT. Instances such as P5 and P6 can achieve 800 gigabits per second theoretical bandwidth over their ENA network interfaces. This CRT feature allows you to distribute your connections, your S3 connections, over all of these network interfaces. Now this is particularly important if you’re using EFA for your other compute-intensive workloads. It prevents EFA slowdowns, and your compute and storage traffic can both run optimally if you spread the load.

How do you do that? It’s as simple as just setting the config and picking the network interfaces that you need using CRT. Let’s look at some sample results. We ran tests on a DL2Q 24 XL instance. We downloaded 1,030 gigabyte files. When we added a second NIC, we were able to almost double the performance. We could get almost double the throughput. With four NICs, we were able to get about 2.5 times the performance compared to a single NIC.

Now, the scaling doesn’t look linear, but you will get some meaningful throughput gains when you add multiple NICs. But the more important thing is that you can spread your traffic or even pin a particular NIC to your process. This will help you optimize how you control your workloads when you’re using a much larger instance.

Purpose-Built Integrations: PyTorch Connector, Analytics Accelerator, and SageMaker Fast File Mode

Let’s talk a little bit about integrations, right? These were all libraries that we spoke about, but we heard from customers that they didn’t want to change their application and they wanted to take these best practices and use them with the existing frameworks that they had. This is why we developed purpose-built integrations so that you could take performance off S3. We’re going to only talk about three of them, but we have a lot more integrations that we have developed.

The first one is the PyTorch connector. When you use PyTorch, you have to build your own data primitives to load and save data when you want to load the data or you want to run checkpoints. We wanted to give you a high-performance integration that took care of a lot of this. This is where the S3 connector for PyTorch comes in. This gives you built-in PyTorch dataset primitives, and it supports both map-style datasets. This is for your random access data and iterable-style datasets, which is when you’re streaming sequential data.

Now for checkpointing, the connector automatically integrates with the PyTorch distributed checkpointing support. With this, you can checkpoint directly into S3 Express One Zone, and you can bypass the local NVMe. What we found was that you could get up to 40% performance increase by bypassing the local storage and going directly to S3 Express One Zone.

New for 2025, we’ve accelerated partial checkpoint loading by 60%, so this is another example of how we’re innovating on your behalf.

How many people here work with Parquet datasets? Awesome. Quite a few. So you probably already know that your query engine has to read metadata from Parquet files, and this metadata is stored at the footer of these Parquet files. This metadata stores information about the file structure and what the schema is. So when you’re scanning across multiple of these files, what you will see is that the metadata will slow down your query execution.

This is where the S3 Analytics Accelerator comes in. This is an open source library. It prefetches metadata using byte range gets. It reads the footer and also caches this footer for subsequent reads. So this eliminates quite a bit of overhead. We tested this with TPC-DS. This is a benchmark that you can run, and we saw that we were able to achieve up to 27% performance improvement just by using the Analytics Accelerator.

Another integration I want to talk about is SageMaker. If you’re not familiar with Amazon SageMaker, it’s our fully managed service, and it’s built for building, training, and deploying machine learning models. Now when you’re running training jobs, you need to load your training data, often millions of, if not lots of small files like images or text samples, and this will generate millions of requests per second into storage.

Now this is where our S3 Express One Zone integration with SageMaker Fast File Mode comes in. What Fast File Mode will do is it will stream data directly from Express One Zone into your training instance, bypass the local disk, and accelerate your training performance. Now since Express One Zone scales to hundreds of thousands of transactions per second, you should be able to see meaningful performance increases with this approach.

Bridging File and Object Storage: FSx for Lustre Integration with S3 Buckets

Great. So earlier I covered how you can optimize interfaces for high performance file interfaces for high performance workloads. We went through some optimizations for object storage, but what if you already have a lot of data in your object store and you want to use a file interface to access that? So this is where Aditi will show you how you can bridge the two.

Thanks, Manish. All right. I’m only going to speak for five more minutes, and this is the favorite part of my presentation, so listen up. So I shared a lot about high performance file systems, Lustre file systems, fully managed, fully elastic in the beginning of my presentation. And if any of that resonated with you or you felt, oh but my data is already stored in S3 data lake, then don’t worry, we got you. We basically have the option to connect your file system with your S3 bucket. This is also just a general paradigm we use within AWS that we want to meet you where you’re at and we want to minimize the work that you have to do to move to the cloud or run your applications in the cloud.

Let me show you how this works. So with FSx for Lustre, you can create a file system and link it to the S3 bucket under the covers. This is your S3 bucket. It has all the objects, all the data stored in it. When you connect your file system to the S3 bucket and try to read and write data from your compute instance, what it does is it immediately takes all your metadata from your S3 bucket and shows it as file metadata on the file system. So for you it would seem like the data is actually on the file system. When you try to read and write data onto the file system, FSx for Lustre will automatically go and fetch that data from S3.

Now, and another interesting fact here is that we take care of bidirectional synchronization. So any changes you’re making to the file system, we are exporting those changes to the S3 bucket, and any changes you make to the S3 bucket, we import those changes to the file system, completely hands off for you and fully managed on the back end. You can think of this as a file system cache in front of your S3 bucket.

Let’s look at a performance when a file system is in front of S3. We took a Kaggle liver dataset stored as a public S3 dataset and measured the time to do patient classification with different setups. The first one that you see right there is where you have your data stored in S3. You load it to your local storage and then you train against that local storage.

What we did was we tried to do a test where we used FSx for Lustre as a cache between your compute instance and your S3 bucket. In the first run of that model, we saw a 67% increase in performance. Now that increase in performance was coming from two things. The first one, you see that green color over there, that is the data loading time where your compute instance is copying the data and it has to copy the data and wait for it to be completely loaded before it can start training. With FSx for Lustre, you spin up your instance and you start training immediately, so we save on the data loading time. And second is just pure performance that a fast file system delivers, which leads to this cumulative impact of 67%.

Now this was the first run where FSx for Lustre was still going to S3 to get that data for you. In the second run onwards, you would see that the performance has improved by 83%, and that’s because the data is already cached onto the file system. Finally, one last customer example to bring all of this together for you, LG AI Research. This AI research lab within LG wanted to create a foundational model that mimics human brain, so they decided to build their own model using Amazon SageMaker and FSx for Lustre.

On the left, you see that they store machine learning training data in their S3 bucket. And for performance and a file interface, they created a Lustre file system linked to that S3 bucket, and their SageMaker training jobs were directly accessing their data through the file system. When workloads finish, model artifacts get written back to S3 for long-term storage and for inference, so they use S3 for longer-term storage and FSx for Lustre for their highest performing hot data to accelerate their training runtime and getting the best performance out of their instances.

This concludes the presentation we had. We basically tried to cover high-performance storage options depending upon whether you’re looking to lift and shift from on-premises, or you prefer a file system interface which is FSx for Lustre, or if you have an S3 native application where you can use S3 Express One Zone. And then if you already or if you prefer file system interface but your data is stored in S3, then you can basically mount your file system to the S3 bucket.

Thank you all for attending. If you haven’t already, please complete the survey. You should get a notification in your app. Manish, Mark, and I will stick around to answer any questions you have on anything related to high-performance storage. Right, thank you.

; This article is entirely auto-generated using Amazon Bedrock.

Overview

Overview

Main Part

Introduction: The Critical Challenge of High-Performance Storage for Compute-Intensive Workloads

Why File Systems Remain Essential for High-Performance Computing

FSx for Lustre: Fully Managed and Elastic Cloud File System with Intelligent Tiering

Delivering Maximum Speed: Throughput, Metadata Performance, and Per-Instance Optimization

Achieving Sub-Millisecond Latencies and Real-World Results with Shell

Amazon S3 Express One Zone: Optimizing Latency and Transactions for Data Lake Workloads

Maximizing Throughput with Parallelization and Multi-NIC Support

Purpose-Built Integrations: PyTorch Connector, Analytics Accelerator, and SageMaker Fast File Mode

Bridging File and Object Storage: FSx for Lustre Integration with S3 Buckets

Similar Posts