AWS re:Invent 2025 - Revealing the northern lights: Amazon Aurora security deep dive (DAT456)

🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Revealing the northern lights: Amazon Aurora security deep dive (DAT456)

In this video, Eric Brandwine and Andy from AWS explain how Aurora protects against database vulnerabilities through defense-in-depth architecture. They detail how Aurora separates compute and storage layers, treating single-tenant head nodes as potentially compromised while protecting multi-tenant storage and control planes. When researchers discovered a Postgres zero-day exp…

Overview

📖 AWS re:Invent 2025 - Revealing the northern lights: Amazon Aurora security deep dive (DAT456)

In this video, Eric Brandwine and Andy from AWS explain how Aurora protects against database vulnerabilities through defense-in-depth architecture. They detail how Aurora separates compute and storage layers, treating single-tenant head nodes as potentially compromised while protecting multi-tenant storage and control planes. When researchers discovered a Postgres zero-day exploit using PL/Perl and PL/Rust, Aurora’s layered defenses—including SELinux, Chronicle telemetry, and VPC flow logs—detected and contained the attack without prior knowledge of the vulnerability. The presentation emphasizes that Aurora doesn’t treat database engines as security containers, instead relying on multiple detection layers, red team testing, and continuous monitoring. This approach allows customers to run older database versions without compromising service security, demonstrating how AWS works backwards from customer needs while maintaining robust protection against sophisticated threats.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Security in Aurora, a Cloud-Native Database Service

My name is Eric Brandwine, and I’m a Vice President and Distinguished Engineer with the Amazon Security team. I’m here with Andy, who is a Principal Security Engineer with the AWS databases team. This is a common occurrence at AWS. We have someone on the security team and we have someone in the business, and they’re working together on something. This is one of the ways that we make sure that we actually deliver security for our customers. Today we’re going to pop the hood on Aurora and look at a particular aspect of security in a cloud-scale distributed database service.

So there are a ton of database services out there because there are many ways to build a database service. If you were building a cloud from scratch, you’d probably take an existing database, or maybe six of them, and you’d throw it on an EC2 instance and offer it to customers. This is the obvious service to build.

And so this is RDS, the Relational Database Service. If you want to run Postgres or MariaDB or MySQL or SQL Server or Oracle or DB2, we offer it as a service. It is the database that you know and love, or perhaps the database that you know and hate, running in the cloud as a managed service. This is awesome. It turns out that we’re taking a huge chunk of the cost of owning one of these machines, and we’re taking that on for our customers. The service provides a ton of value, but we’re constrained in what we can do. We are taking what is essentially a single host database and we’re offering it as a service, so we’re going to continue to offer something that is a single host database. Maybe you’ve got replicas and things like that, but now you’ve got multiple single host databases. This is a thing that was fundamentally designed to run on a piece of sheet metal in a data center.

On the other side of the spectrum, you’ve got something like DynamoDB. DynamoDB is awesome. It is built for the cloud. It is this huge, scalable, serverless thing. You can scale up, you can scale down. It doesn’t scale vertically. It’s inherently multi-AZ, so you get better availability, you get better data durability, all of these magnificent benefits. There’s so much more benefit that we can provide to our customers because of the nature of this database service. But it’s DynamoDB. It’s not a SQL interface. You have to build to DynamoDB. And so customers love this service. It’s been very successful. But you’ve seen enough of these PowerPoint slides before. You know that there needs to be something in the middle of that spectrum.

And so this is Aurora. It’s a cloud-native database. We’ve been able to do all sorts of interesting things: elastic storage, multi-AZ availability, increased durability. But we’ve managed to maintain compatibility with existing engines like MySQL and Postgres. And in the case of Aurora, it’s only open source engines because of the way that we built the service.

Aurora’s Architecture: Splitting the Database into Head Nodes and Storage

So let’s look at this. To build Aurora effectively, we took the database and we sawed it in half. And so as a customer, the node that you’re interacting with, the thing that you’re interacting with, is the head node. And it’s running actual MySQL or Postgres code. So the query planner, the query parser, anything that needs to be done in a single memory space like joins, all of that is happening here on this node. But the bottom half of the database has been completely replaced. And so that’s allowed us to make different design choices to provide the availability, performance, and durability that we knew we could get out of a multi-tenant cloud service. And so you have at least one head node, but there can be many of them, and they can be spread across availability zones. And because this isn’t where the storage lives, these things aren’t stateful. When a transaction is committed, this node is no longer important to the availability of your data.

And so inside of this box, it’s a standard EC2 instance, so it’s going to have an Elastic Network Interface, an ENI. And you can connect this to your VPC. This is how you interact with the Aurora service. It appears in your VPC. You get flow logs, you get complete control over the security groups on this ENI. It’s a part of your network.

And so that’s great. That’s how you interact with it. But we need to be able to manage this. We need to be able to send traffic in and out of this instance. So there’s a second ENI connected to the Aurora network. And on this network, we also have the multi-tenant, multi-AZ storage system, and it’s a fascinating system. There’s many, many talks about this storage system. It is designed specifically for this workload. It is the Aurora backend. It’s not something else.

Aurora offers lowered management overhead, increased performance, and increased durability. It’s a wonderful system, but that’s not this talk. You’ve all launched EC2 instances, at least I hope you have, but none of them suddenly started acting like Aurora head nodes. That’s because we have software that we run on there. And so this is the top half of that database. It’s running a chunk of the database engine, and that’s what you interact with. That’s where you’re sending your queries, that’s where they’re being planned out and executed.

Also pretty obviously, we need to have a component that is on the host that we use to manage that host. And this is one of the places where the clever naming schemes failed us. So the Host Manager component is called Host Manager. And the Host Manager is what talks to the control plane for the service. And so the storage system and the control plane are multi-tenant. So now that we’ve got this shared understanding of how we built Aurora, let’s talk about the threat model.

Threat Model: Why the Database Engine Cannot Be a Security Container

So we’ve got the multi-tenant storage system, and then we’ve got a customer VPC and it’s talking to their Aurora head nodes. And of course we have more than one customer. We have many more than three, but three was enough to bring PowerPoint to its knees. And so all three of these, all through these head nodes, and many more, talk to the storage backend, and they also talk to the control plane. So this is the world that we’re living in.

Our customers interact directly with the database engine. These are standard database engines. They were written by people outside of Amazon, and they were written to do database things and not security things. Fundamentally, a database is designed to be owned by someone, hosting data for that person, executing queries for that person. It’s a single party system. And now, we’ve got a database that was written by one company, being hosted by another company, and being offered as a service to a third company. This is not where databases were designed. And so it is a mistake to treat a database as a security container.

So we do have the source code to these. That’s how we sawed it in half. But we don’t want to have a whole raft of proprietary patches that we’re carrying forward. And our customers want new features as soon as they’re released. There’s a new Postgres release. Our customers want those features immediately. We have to get those releases out to them, and the more burdensome the patching process is, the slower we’re going to move.

And our customers want deep, rich interfaces. They don’t just want a little tube through which they can cram SQL queries. They want to be able to manage their data, they want to be able to extract value from it. They want to be able to run complex functionality across that data. And so we need to supply those interfaces to our customers. And database upgrades are often painful. They’re often breaking changes, they require downtime, no one likes downtime.

And so many of our customers want to run an older version of an engine. They don’t want to upgrade. The business isn’t ready for it, it’s their peak season. Maybe they’re having a conference in Las Vegas and they have a change control window that’s closed. Whatever, they don’t want to upgrade. And so we don’t want to drive customers’ upgrades on our schedule. And let’s face it, in many cases it’s not on our schedule, it’s on the vendor’s schedule, because they’re going to release patches when they release patches, not when it’s convenient for us, and certainly not when it’s convenient for our customers.

And so as a result, we do not treat the database container as a security container. We can’t. And so we take security at this layer seriously, we do all sorts of patching, we keep up, but fundamentally, we made the decision that we can’t rely on this layer. That’s okay though. The head node is single tenant. It’s not only dedicated to a single customer, it’s dedicated to a single database for that single customer. But the rest of this infrastructure, the storage system and the control plane, are multi-tenant.

And so the control plane includes the API endpoint that all of our customers interact with. This is how you allocate databases and scale them up and things like that, and also the API endpoints that Host Manager interacts with. And so the storage network holds all of the Aurora storage. These are all of the bits for all of the customer databases in that region. This is the foundation of our threat model.

The Postgres Zero-Day Attack: How Aurora Withstood a Novel Exploit

When we’re thinking about threats to the service, we consider the head nodes to be sacrificial. We don’t want people to gain access to the underlying host, but we assume that they can, and we have to prevent them from moving past this host. They can’t gain access to shared storage or to the control plane. And so this year, in I think this very hotel, at the DEF CON security conference, Tal Peleg and Coby Abrams from Varonis presented research that they’d done on the Postgres database. This was a novel attack. This was a zero day. This was a new vulnerability that they’d discovered in the engine.

PL/Perl is an extension to the database. It allows you to add new functionality to the database in the Perl programming language. This is an example of one of those deep rich interfaces that our customers love to have access to. Well, great, it’s not a predicate for a query, it’s not some simple constrained thing that you can test deeply. It is a complete code execution engine. So if there’s going to be a bug in the database, this is the kind of place I would expect there to be a security bug in a database.

So, Tal and Coby have found a new way to elevate privileges in Postgres, and they attempted this technique against all of the cloud providers that offer Postgres as a service. This is a zero day, previously unknown, no patches available, because no one at Postgres even knew that this existed. We’re running Postgres, so when they tried this on Aurora, it worked. It’s the Postgres front end. They have a bug in the Postgres front end, of course it worked.

But this is what they wound up presenting about AWS and Aurora on stage. The novel attack worked on the database, but it was ineffective against Aurora, despite the fact that this was a previously unknown technique. We stopped further pivoting or exploitation, even though we’d never seen this attack before, and we got some nice kudos from the researchers, which was delightful.

How did we get there? The answer is not flashy. There’s no smoke and mirrors here. We started with a belief that taking a database designed to be installed and operated by its owner and trying to treat it as a security container as part of a multi-tenant service was not feasible. It was just a non-starter. And so we have defense in depth, multiple layers of detections, automation, and this has all been built up over years of investment. We keep testing what we’ve built and improving it. It’s not a flashy answer, but it’s one that we’re, I think, justifiably proud of.

Timeline of the Security Event: Detection, Response, and Decision-Making

So let’s see how this event unfolded. First, the researchers installed PL/Rust on their instance and rebooted it in order to start experimenting. And within about an hour, they’d gotten their exploit working and we started getting alarms from SELinux, which Andy will tell you about. Now, this is new functionality in the database. And these alarms aren’t finely tuned yet. And so our initial response, we weren’t sure if this was valid PL/SQL behavior running into one of our SELinux policies, or if this was some nefarious activity. And so these alarms went off, but they paged the service team. They didn’t page security.

However, as the researchers continued to move around on the box, they tripped another tripwire in a system that we call Chronicle. And these are mature alarms. These paged the security team immediately, and so that brought the security team into the mix. And so this was not a new account. This was not a fraudulent account. This was a mature customer, they had a long history. They were paying their bills. And this instance was not a newly launched instance. It wasn’t tagged research or anything like that. This is not a researcher that had coordinated with us ahead of time.

And so we’re looking at this as a valid customer instance. We’re watching it carefully, we know what’s going on. We know they haven’t pivoted, and so we engaged legal. And we talked about our path forward. We made an intentional decision here. And so had we had any evidence that they were moving past that head node, that they were successfully pivoting, we would have taken immediate action. It turns out that this customer had a TAM, a technical account manager, and so we reached out to the customer via the TAM, and we eventually took the decision to basically cut off all network access, which caused the database to go into a storage failure mode.

And so, it’s important to remember that the purpose of this mechanism is to protect Aurora itself. We offer GuardDuty Advanced Threat Protection. And that is a service that customers opt into. And so that service can look at queries, it can look at access patterns, it can look at sources of connections, and it can alert you as to whether or not this behavior is anomalous or not for your database. But this is us acting on our own behalf. We’ve made some deep privacy and security guarantees to our customers. We’ve told them that we will not look at their data.

And so we can’t tell if this activity is anomalous or normal. It might be a DDoS attack, it might be the best day for your business, and we can’t tell the difference here. So this is for us to protect Aurora. It’s always on. It’s an inherent part of the Aurora system. And again, our choice here is leave the database running or stop the database. And had we seen further activity, we would have immediately stopped it. To share some detail on how we achieve this result, I’ll hand off to Andy.

Security Controls in Aurora: SELinux, Chronicle, and VPC Flow Logs

Thanks Eric. So let’s look at some of the security controls in Aurora that helped us get to this outcome that we just described.

We have a number of controls in place. To start, I’m going to talk about three of them: SELinux, Chronicle, and the VPC flow logs. SELinux provides permissions control, Chronicle is our telemetry service, and we tune both of those to be fairly aggressive. We treat everything from the head node as untrusted. One of our red team members has actually done presentations internally within the company for our builders to learn about poison telemetry and other risks so that they harden the operational consoles they use to run these services and perform common activities. So again, that’s an example of working together with security and the engineers who are building the system from the ground up.

So the first control I want to talk about is SELinux, and this is Security-Enhanced Linux. It was written and released by the US National Security Agency in 2000, and it is a kernel security module that allows a number of stronger security controls like mandatory access control to be enforced. It allows you to get very granular with the access of your resources, the permissions, and the processes that are doing that. It also has a lot of people complaining about SELinux because it takes deep understanding and nuance to configure and tune this complicated system. When you run it on a general purpose computing platform, it is tedious or it can be tedious and frustrating, right? By default, it will block a lot of activities that you may have intended to succeed. But in a specific and constrained environment like Aurora, we know exactly what processes are running and what resources they need access to, and which resources they don’t need access to.

So these are two of the SELinux denials that occurred during the research that Coby and Tal were doing. And this is dense, but you can see that the Cargo process, which is the Rust package manager and build tool, is getting denied when it tries to execute code. And you can see that that code was coming from the Postgres engine for RDS and Aurora. So now looking at this head node diagram, you can see where SELinux fits into the head node. And by running the environment with these strong controls, we’re able to detect and enforce resource permissions, process execution, and more. And we study the history of the necessary database activities and we design the rules in concert with our security engineers to prevent inappropriate access.

So Chronicle is a telemetry service which has an agent running on all of our AWS hosts. And it allows us to monitor process execution, specifically the execve invocations, kernel loads and unloads, and more. And it transmits these events to a service and we can query those, both security engineers and the fleet owners, to analyze the data and create alarms. And as we mentioned earlier, general purpose compute environments like EC2 are going to have a plethora of different possible expected activities. But in the database server, we don’t have a diverse set of services running. We have alarms that have actually triggered because a database engine upgrade changed behavior, right?

We make strong privacy promises to our customers, and we can’t look at customer content without explicit customer consent. So even the engine log files are redacted when we view them in an incident like this. We basically see that an error occurred and we see a timestamp, but we don’t get the error itself because that might have customer content in it. But when we get the Chronicle alarm, we can see exactly what’s happening outside of the database because that’s a process that’s running. So we got a Chronicle alarm for Curl, which happened in this case. That’s not expected behavior, and we can react very quickly to that.

So going back to this, we’ve now added Chronicle, and you’re seeing this broader picture unfold. And I mentioned VPC flow logs. So these are network logs that allow you to specify a level of granularity on what interfaces you want to capture the data and then track that IP traffic, the connections between resources. And within Aurora, we monitor these to detect unexpected traffic, to see attempts between VPCs. And we can use network isolation to restrict which VPCs can communicate with each other. And for example, customer VPCs can’t talk directly to our storage VPC. And this combination of security groups, private subnets, and the VPCs allow for robust network monitoring and restrictions.

So this is an example of a flow log entry. It’s fairly inscrutable at first, but we’re going to break it down, and I’m going to deemphasize fields like version number and timestamps so that you can look at the relevant information. This is an account that ends in 010. It has a network interface that ends in 789. We’re seeing that the account is sending traffic from the network interface that ends in 139 to the destination that ends in 21, and it’s on port 22. And finally, that traffic was accepted. So now you know how to interpret a flow log. You can imagine that having massive amounts of this kind of data allows us to analyze and tune our queries and our analytics to get really valuable information out of this traffic.

So a little bit more readable interface for how that flow log represents the data. And now we have our VPC controls. Restrict access to both the control plane and the storage VPCs to only the ENIs that we allow, and we combine that with IP tables firewalls and capturing those flow logs and continuing to monitor them for anomalies.

Defense in Depth: Additional Controls and Proactive Security Programs

So in addition to the three controls I just discussed, we have a number of other ones as well that didn’t come into play in this specific event, but they provide protection against a wide variety of threats. The head nodes need to make API calls, and we could have used EC2 roles for instances to restrict what they could do with the credentials, but that would mean that every head node has the same policy. And so that would introduce a possibility of cross-instance privilege escalation. Instead, we have an instance role that only has permission to access the token broker, and then that authentication for the instance gets an AWS API key, and the key is scoped down to only exactly what that head node instance needs to do.

Similarly, our host manager APIs are only invoked by our internal head nodes, but we built those as internet-facing APIs to provide another layer of defense in depth. If there was some privilege escalation on the head node, we’re still not losing a trust boundary because, as Eric mentioned earlier, we don’t consider that a trust boundary.

So we have the continuous evaluation of SELinux denials, of the Chronicle detections. Those are good, but we’re also looking for what we call dogs not barking. We use a number of canaries to test these controls to make sure that we are getting them correct. And this is expensive, but the trade-off is not one unit of security engineering work to protect one instance. It’s to protect millions of instances. We are amortizing this cost and this investment over a vast scale of the service. To be blunt, even if we started today with near-infinite generative AI software engineering capabilities, we could not rebuild the service overnight because the value of years of experience and analytics of this data is fundamental to what we’re building.

So we just told you a bunch of great things. What are our blind spots? We’re proud of the work we’ve done. As we continue to build it, we know that we have these inherent biases. So how do we use repeatable, verifiable mechanisms to demonstrate correctness as opposed to good intentions or just hoping everything’s okay?

What’s in it for our customers? You rely on us to deliver delightful experiences. Part of that is holding the high bar for security. We can’t just follow best practices.

So we invest a lot in proactive security. We have a strong program. I’m going to discuss three components of that. First, our application security program has well-defined processes for threat modeling, risk identification, security evaluation, and the security engineers work with the builders at every phase of the software development lifecycle to build security in. And this is from ideation and design through implementation, the use of security-focused application libraries, testing, deployment, and monitoring. And we do penetration tests as part of this, and those are really valuable, but they’re intentionally scoped and they verify the expectations of system behavior and validate the controls we have in place.

We also have a bug bounty program, and we work with top security researchers in the community. In fact, I fly from here to London to do an in-person event, and we have invited experts. We have researchers who have already demonstrated a track record of finding critical issues. We are confident in our services. At a recent event, we actually gave those researchers production access, excuse me, root access to non-production instances to see if they could break out and access the multi-tenant storage, and they were not able to.

So we have external researchers that investigate our systems, and that’s great because they’re independent, but we also employ some incredibly talented offensive security engineers in the company. Our database’s red team is designed to simulate sophisticated threat actors and external adversaries and essentially has an unbounded purview. This is unfair by design. It is an open book test. They have extensive knowledge of the internals of our systems, the defenses, and they routinely find novel issues in the database engines themselves. This is just a list of a few from the last few years that they’ve reported.

But they don’t just find an issue and report it and then move on. They work hand in hand with our builders to demonstrate how they’re thinking, how they’re approaching this, what controls would have stopped them, and then they validate that with the builders. So this is a virtuous cycle, and it is constantly evolving and reinforcing itself and our confidence in these controls. They also help when we’re doing forensic analysis. So in one instance we were able to detect some suspicious activity on a customer instance. We reached out to the customer, and their security team had hired pen testers. Their operations team didn’t know that. They didn’t catch them, but we did, and they appreciated the heads up.

Revisiting the Event: How Multiple Security Layers Worked Together

So let’s go back and revisit the event with all this new knowledge and see how the pieces came together. So we start with the researchers installing PL/Rust, PL/Perl on their instance and rebooting. Well, from the beginning, the network isolation means they are prevented from interacting with storage service, the VPC, the control plane VPC. They only have access to their single tenant head node. And this is where the SELinux alarms are appropriately, excuse me, denials are appropriately blocking access from cargo trying to execute commands. So we can see SELinux now doing those denials. It’s a new control. So as Eric mentioned, these are getting routed to the engineering team.

So the engineering on call start looking to diagnose it. Like, is this unexpected upgrade behavior or is this something else? Well now Coby and Tal have been working at it for a while, and they are able to run code by nesting PL/Perl functions with PL/Rust functions which leveraged the Rust GDB environment variable, and they were able to actually run a shell command. OK, well this is when we get Chronicle alarms about Curl. And these alarms are well tuned. We have extremely high confidence in them. So this immediately pages the security on calls.

So during that call, a security engineer in his first ever on call is paging red team, escalation, legal, everybody, and we’re able to move quickly but not rush because we have confidence in the controls that are in place here. And this is not expected activity. The customer account isn’t part of our bug bounty program. As Eric mentioned, it’s a mature, paying customer. So they have production workloads running, and we can’t just terminate instances. Our red team lead is actually reaching out through the community and figured out it was Coby based on some metadata in the database name. And then we’re engaging with the TA. This is a customer in another country, another time zone, so we’re waking people up in the middle of the night. But we’re handling it responsibly and we’re not disrupting their operations. And that kind of nuance and high judgment is difficult, if not impossible to encode into an AI response agent. Having humans in the loop matters here.

Once we made a decision to snapshot the instance and move it to storage failure rather than disabling the entire account, we actually confirmed with Coby and Tal what was happening. They were pretty surprised, but we were able to safely terminate the instance, and now I’m going to turn it back over to Eric. Thanks, Andy. So the way these things usually go down is the adversaries have done some sort of testing. Like any of you could install Postgres on your laptop right now, and you could work out this chain, and you could figure out all of the dependencies you have, and you can script it up, and you can run it and it executes like that. Takes a couple of seconds, certainly less than a minute, and you’ve got control over this node.

And one important thing about what Andy walked us through is that the SELinux denials actually worked. We stopped the chain. They were not able to execute cargo. Like that’s great. So now we’ve taken them from their well designed, their baked, their tested plan, and now they were innovating.

There’s a ton of detail in here that Andy just alluded to, like pivoting from PL/Perl to PL/Rust and then abusing an environment variable in the debugger environment. These were smart researchers, but they were innovating in the middle of an attack. That places them at a disadvantage, and that’s a good place to be.

Even though the SELinux denial was not sufficient to prevent them from gaining access, it was a huge speed bump. It immediately killed their momentum. It made them start experimenting and made them start working on new things. The timeline in the first couple of steps here shows this was not them coming in and winning the lightning strike, gaining access. This is them trying more and more commands. It was only after they were able to execute curl, which should never be run in this context, that they made progress.

The first thing you’re going to ask is, well, why doesn’t SELinux deny access to curl? These database engines are complex, and it turns out that they shell out to curl on their own. There’s all sorts of remote data connectors and things like that. So the curl binary needs to be present, but we know the patterns in which it’s called. We know what the parent process is going to look like, and we know what the environment is going to look like. So the binary is still there. We got our pager ticket, and we were able to respond very quickly.

Just having a partial protection in place, slowing them down, jolting them out of their groove, and making them start innovating gave us the advantage. It changed the balance of power here. Over time, the SELinux controls get tighter and tighter until they release a new database engine and it introduces new behavior. This is a forever job. This is not something that we ever get to finish.

Design Philosophy and Future Challenges: Building Security from Customer Needs

Another wrinkle with what Andy just walked us through is systems like VPC flow logs are Chronicle. We designed the collectors for these to be as simple as possible. I tell people to keep them stupid, because the Chronicle agent is deployed on literally tens of millions, hundreds of millions of hosts around the world. If it starts consuming excess memory, how many exabytes of memory are we going to have wasted on this agent? The more complex functionality we put in the agent, the more likely it is we’re going to have a broad problem on our fleet.

We’ve made promises to customers that we won’t have correlated problems across availability zones. We won’t have correlated problems across regions. Well, if the Chronicle agent starts eating memory in every region, we’re going to break that promise. So we keep these things simple. One of the best parts about securing a cloud is that you have a cloud. We’re hoarders. We collect these Chronicle logs, and we warehouse them.

We have VPC flow logs. Last year at re:Invent, I gave a talk on Sonaris and active defense. We have VPC flow logs for every ENI. When we get smarter, we don’t have to do a fleet-wide deployment. We already have the data flowing through our analytical systems. We already have the data flowing into S3. When we get smarter and we want to build a new detection, we just build the new detection and we deploy it to the analytics fleet, which is dramatically smaller than the entire cloud.

We can then use the logs that we’ve got stored as a time machine. We can say, in the past 30, 60, 90, 4,000 days, have we seen this behavior? And we can definitively say we have or have not seen this behavior. It’s incredibly valuable. As we build these systems in the future, I really like this design where you’ve got the thinnest collector possible, smart analytics in a central system, and then a data warehouse.

The data warehouse is absolutely a cost consideration. You can burn a lot of money in S3, and so you get to tune the dials there. What do you keep? How do you cook it? How long do you retain the cooked stuff? How long do you keep the raw stuff? But it’s incredibly valuable. It is one of my favorite things about working in a cloud.

This is another example of a different design point. If you look at a typical client endpoint, an EDR agent or something like that, I might be on terrible Wi-Fi, certainly not at re:Invent. Wi-Fi at re:Invent is awesome. But I might be on a cellular connection, I might be intermittently connected. You have to make that agent smart. If I double-click on a file and it has to call out over cellular tethering, and it takes 40 seconds to open that file, the tickets are going to roll in, the users are going to burn me in effigy, and that agent’s going to get uninstalled.

But in the cloud, everything’s in the cloud. S3 is a millisecond away, it’s glorious. And so this design point for our cloud-based systems has served us incredibly well. So we believe that this is unique in the industry. I’m not aware of any capability like this anywhere else. Aurora itself is already unique. It’s an off-the-shelf code and SQL compatible database that provides you multi-AZ, multi-tenant performance. It’s incredible, but this is further.

And dealing with the database, even when you’ve got RDS, even when you’ve got Aurora, owning a database isn’t free. These things are baked deep into the core of your applications. This is the standard thing. Remember three-tier applications? This is it. This is where all of your data lives. And so upgrades are hard, changes are hard, it’s a huge testing burden. And one of the paths that we could have chosen to take is, look, we’re offering you this service, we’re making security promises. In order to deliver on those security promises, you have to keep up and you get this window. And if you don’t patch within this window, I don’t know, we’ll force patch you, we’ll turn your database off. There’s no good answers here. In this case it wouldn’t be the users burning me in effigy, it’d be the customers burning me in effigy. This is a bad path to go down.

And so we didn’t go down this path. As I opened with, we also decided that we couldn’t possibly treat a database as a security container. And that left us with a harder problem. We have to figure out how to enable customers to run whatever database it is that they think that they need to run, and it’s not my job to tell them that they’re right or wrong. This is their choice to make. I’m not deeply familiar with the constraints of their business. You know, maybe there was some regulatory discovery thing and they had to exhume a five-year-old snapshot of an application. That’s what they need. Fine, we’ll support that.

And so if we learned of a security issue in a database and that became critical to the safety of our service, you know, if a customer chooses to run an old version of a database, they’re doing that with their eyes open, they’re taking on some risk. That’s fine, that’s their risk to take on. They’re allowed to place their database at risk. They’re not allowed to place the service at risk. They’re not allowed to place other customers at risk. And so if we learn of a security problem in a database and we have to tell customers now in the middle of an SEV-2 pager ticket, you’re upgrading, it’s not the right customer experience.

And so add in the fact that this particular CVE that the researchers found was a zero-day. We didn’t know of the patch, because there wasn’t a patch. No one could have prevented this, because it was previously unknown. Had we gone down the path of treating the database as a security container, this would have been a real problem for us. As it was, we handled it reasonably well. And so this is an example of working backwards from our customers. You hear Amazonians say this all the time. You know, we’ve got this set of leadership principles, and the leadership principles are an unordered set, but every single time you see the leadership principles printed out, customer obsession is the first one.

And so you hear us talk about working backwards from our customers. What is the right outcome for customers? The right outcome for customers is their patching timelines aren’t driven by us. Their databases aren’t constantly taking down time. And so this is where we wound up. We’ve made sure that the security of our service does not depend on the security of the engine itself. And this is an inherent part of Aurora. This isn’t something you have to configure, this isn’t something you have to turn on. And so while it doesn’t provide the kind of granularity that GuardDuty provides, it’s always there. 100% of Aurora instances have these mechanisms on them.

And while its primary purpose is making sure that the Aurora service is safe, it makes the tough decision for the customer’s part to run an older database a lot easier to take on. Had we seen pivoting here, had we seen malicious activity, had we seen data exfiltration or destruction, we could have shut the doors. And this is another example of our deep ownership here. This is not something that you can do in a database engine, this is not something you can do in a database. It’s something you can only do in the context of a service. And because we own all of the layers of this stack, we’re able to work backwards from the right customer experience, from first principles, and provide our customers with unique protection, in this case, even against zero-day vulnerabilities.

And so this talk was a deep dive into a single aspect of security in Aurora. It’s not an exhaustive list of all of the security mechanisms we have. It didn’t talk about our extensive use of encryption on the network and in storage, our integrations with KMS where you can choose which keys get used and control access to them. There have been many talks and many blog posts and many papers about those features, but that’s not this talk.

We wanted to peel back the covers a bit and give you a glimpse into what goes into securing a system like Aurora. The result is something that we’re really proud of, but it’s not the end. There’s no clever one simple trick here. Our job was definitely made easier by the investments that we’d made. Chronicle is used broadly across Amazon, and that was a building block that we could take advantage of. VPC flow logs are ubiquitous. They come with the AWS services, and that was something that we could build on.

And we have a world-class streaming analytics team, really, really good at it, because we stream so much data through. We’ve refined these systems over years. And so it’s this endless process of iteration and improvement. I’ve made it most of the way through this talk, and I have not said the term. I will now say the term. I’m contractually obligated to use the term, I’m sorry. Gen AI is going to change everything.

And one of the things that we have seen people doing is using Gen AI to generate variants of attacks, variants of shellcode, variants of SQL injections, variants of everything you can imagine. And so as service owners here, our lives are going to get exciting. There have been a bunch of papers, a bunch of blog posts across the industry about people using Gen AI to do code analysis, to find novel vulnerabilities, turning it into a zero-day factory. We’re expecting the pace to continue to accelerate, and so we’re keenly aware that our job here isn’t done, that we can’t rest on our laurels.

We’re proud of this, we like what we’ve built, but one of the huge ingredients here is our red team. And I described the red team as world-class. There’s that list of CVEs that they found in a whole bunch of different database engines, and they go deep. But the thing that makes this red team world-class is the fact that in many organizations, a red team is a breaker. Their goal is to make the biggest crater they can. It is to demonstrate their prowess as a force of evil.

And at Amazon, our red teamers are builders. You do the thing, you gain the access, you plant the flag, and your work has just started. You now need to go back to the team, you need to educate them, you need to make sure they understand not just what you did, but how you thought about this, what led you to this path, what all of the different little things are. Because unless your service is brand new and you haven’t invested in security, there’s no obvious path in.

It’s a little bit of access here, a little bit of access there, and information disclosure over there, and every time you close one of those, you make their job harder. And so our red teamers stay engaged until the thing is fixed. That’s what they’re measured on, that is what their aim is, to fix things, not to break things. So thank you very much for joining us here in Las Vegas. There’s not that much left in the conference, but I hope it goes well for you. I hope you enjoy re:Invent. And that’s it. That’s the recipe. Please do fill out the survey.

; This article is entirely auto-generated using Amazon Bedrock.

Overview

Overview

Main Part

Introduction: Security in Aurora, a Cloud-Native Database Service

Aurora’s Architecture: Splitting the Database into Head Nodes and Storage

Threat Model: Why the Database Engine Cannot Be a Security Container

The Postgres Zero-Day Attack: How Aurora Withstood a Novel Exploit

Timeline of the Security Event: Detection, Response, and Decision-Making

Security Controls in Aurora: SELinux, Chronicle, and VPC Flow Logs

Defense in Depth: Additional Controls and Proactive Security Programs

Revisiting the Event: How Multiple Security Layers Worked Together

Design Philosophy and Future Challenges: Building Security from Customer Needs

Similar Posts