🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Advanced data modeling with Amazon DynamoDB (DAT414)
In this video, Alex DeBrie covers advanced DynamoDB data modeling, focusing on three key areas: secondary indexes with the new multi-attribute composite keys feature that eliminates synthetic key overhead, schema evolution strategies including handling new attributes and backfilling existing data, and common anti-patterns like kitchen sink item collections and over-normalization. He emphasizes und…
🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Advanced data modeling with Amazon DynamoDB (DAT414)
In this video, Alex DeBrie covers advanced DynamoDB data modeling, focusing on three key areas: secondary indexes with the new multi-attribute composite keys feature that eliminates synthetic key overhead, schema evolution strategies including handling new attributes and backfilling existing data, and common anti-patterns like kitchen sink item collections and over-normalization. He emphasizes understanding DynamoDB’s partitioning model, consumption-based pricing, and API design to achieve consistent performance at scale while keeping implementations simple and cost-effective.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Advanced Data Modeling with Amazon DynamoDB
Thank you all for coming. This is Advanced Data Modeling with Amazon DynamoDB. I’m Alex DeBrie. I’m really grateful for you showing up here on a Wednesday morning. This is my seventh year talking at re:Invent.
In an hour, I can’t cover everything about DynamoDB. I try to cover different topics every year, but if you want to look at previous years, these are all on YouTube. You can check out those topics. Additionally, there are a lot of really great talks this year from all over the spectrum—data modeling, architecture. There’s a great one by Craig Howard that’s tomorrow about the service disruption that just happened. There’s a lot of really great stuff. Some of these have already happened, and some you probably can’t get into, but check these out on YouTube as well. Really great speakers for all of these.
In terms of what we’re going to talk about today, I’ll start off with a little bit of background and some data modeling goals and process, and then we’ll dive into some topics around using secondary indexes well. There’s a really good release we had two weeks ago that I’m super excited about—talking about schema evolution in DynamoDB because I get questions around that a lot. And then there’s a quick anti-pattern clinic around some anti-patterns I see and how you can address them instead.
I like to think about it as: we want to first set some baseline facts just so we’re all on the same page, maybe have some higher level concepts and things we should be keeping in mind as we’re doing this modeling, and then some actual application where we’re applying it to our data modeling. I’m Alex DeBrie, AWS Data Hero. I’m going to talk fast. I have a ton of slides, and they’re worried I’m actually not going to get through it all. So I probably won’t be able to do Q&A in here, but I will do Q&A out there if you want to. I will also be at the DynamoDB booth in the expo hall most of this afternoon. So bring questions, bring your friends, and let’s talk DynamoDB.
DynamoDB’s Unique Characteristics: Fully Managed, Consumption-Based Pricing, and Consistent Performance
All right, let’s get started with the background. I want to start off with some unique characteristics of DynamoDB because DynamoDB is a unique database. It’s different. Most people know relational databases, right? And if you want to model well in DynamoDB, you need to learn some different things. You need to change what you’re doing. You need to teach a lot of people on your team. Given that, I think if you want to use DynamoDB, you should want one of its unique strengths and unique characteristics.
The three that I always think about are: number one, it’s fully managed, right? It’s fully managed in a different way than RDS or some other managed database service. With DynamoDB, they have this region-wide multi-tenant, multi-service, self-healing giant fleet with storage nodes, load balancers, request routers, and all sorts of different infrastructure. You cannot take down DynamoDB. That’s not going to happen. You could overload a relational database or OpenSearch or some other database, but not DynamoDB. So it’s really fully managed and pretty hands-off operationally compared to most other databases.
In addition to that, it has a consumption-based pricing model that I love. You’re not paying for an instance. You’re not paying like a traditional database where you pay for CPU, memory, IOPs, and however big your instance is. You’re actually paying for consumption. With DynamoDB, you’re paying for read capacity units—actually reading data from your system. Every four kilobytes of data you read consumes a read capacity unit. Same with writes: every one kilobyte of data you write to DynamoDB, you consume a write capacity unit. And then you’re also going to pay for the data you’re actually storing.
There are some unique implications of that. The big one that I like is very predictable billing impacts. If you have an existing application and you want to add a new secondary index, you should be able to calculate pretty easily how much it’s going to cost to backfill that index and how much it’s going to cost going forward, knowing your write patterns. I also say that a bunch of DynamoDB costs you can do in Excel. You don’t have to actually spin up a database and see how much it’s actually using. You can do this in Excel, and it’s pretty nice to do that.
There’s also a really tight connection between how efficient you are, how you model and think about efficiency, and your bill. DynamoDB is pricing it in a way that’s giving you signals on how you should be using it. I’d say take those signals, use them, and you’ll save a lot of money on your bill. I love that consumption-based pricing model. And then performance-wise, the one people think of is consistent performance at any scale. You can have DynamoDB tables that are small—there are lots of megabyte level tables—but there are also lots of terabyte level tables and petabyte level tables. DynamoDB gives you consistent performance at any scale, no matter which size you’re at.
Understanding DynamoDB Architecture: Tables, Items, and Partitioning
I would say these three things—ops, economics, performance—at least one of these should stand out to you. I want that from DynamoDB. That’s why I’m going to learn DynamoDB. That’s why I’m going to change how I do some modeling to make it work within DynamoDB. Let’s do just enough architecture to understand that consistent performance at any scale part. If we have an example data set here, this is some users in my application. Those are going to be contained in what’s called a table. You’ll create a DynamoDB table to store your records. Each record is going to be called an item. When you create your table, you have to specify a primary key for your table. That primary key must be included on every record and must uniquely identify every record in your table.
In addition to that, when you’re writing items in DynamoDB, you can include other attributes on it. You don’t need to specify these attributes upfront. There’s no schema that’s managed by DynamoDB itself, other than that primary key. The schema is going to be managed within your application layer. Those attributes can be simple scalar values like strings, integers, or they can be complex attributes like lists and sets and maps.
The primary key concept is so important to how DynamoDB works. There are two types of primary keys when you’re creating your DynamoDB table. There’s a simple primary key, which just has a partition key. Our users table, for example, just has that username that uniquely identifies every record in our users table. But we also have a composite primary key, which is probably more common depending on your use cases. A composite primary key has both a partition key and a sort key. So we can imagine in our application, maybe users can make orders. In that case, we might want to use this composite primary key which has these two different elements: that partition key of a username and that sort key of an order ID.
Note that each one of these are distinct items. Even though some of those share the same partition key, the key there being that the combination of that partition key and sort key needs to be unique. With that primary key, you’re going to choose this type when you create your table—simple or composite. You can’t change the key names or anything like that after you create your table. The combination of that primary key, whether simple or composite, has to be unique for each item.
Now as you look at that primary key, both of those elements have that partition key element, which is probably one of the more important things you need to know about DynamoDB and using it well. When you’re creating a DynamoDB table, behind the scenes, they’re going to be spreading your data across multiple different partitions. Let’s say you start with two partitions. When a write request comes into DynamoDB, say you want to insert a new user into our table, that front end request router is going to look up the metadata of your table, understand what its primary key is, look at that partition key value, hash that value, and then figure out which partition it needs to go to. So in this case, that item belongs to partition one.
The great thing about this is that as your data grows, DynamoDB behind the scenes can add a third partition, add six more partitions, add a thousand more partitions. It doesn’t matter. That first step of figuring out which partition an item belongs to is a constant time operation. So even if you have one of those ten terabyte tables, it’s still going to be a constant time operation to get down into about ten gigabytes of data. Those partitions are all managed for you. You don’t have to add new partitions. DynamoDB is doing that behind the scenes for you. That’s where that consistent performance at any scale happens.
The DynamoDB API: Single Item Actions, Query Operations, and Mental Models
The most important thing you need to know is the combination between partitioning and how the DynamoDB API exposes that to you. These items are going to be spread across your partitions by that partition key, and then you want to be using that partition key and using that whole primary key to identify your items. Rather than giving you a SQL interface with a query planner behind the scenes, you’re basically getting direct access to those items through the API.
DynamoDB has what I call single item actions, which are all your basic CRUD operations. If you’re inserting a record, reading a record, updating a record, or whatever, you’re doing that with your single item actions. In this case, it requires the full primary key. You need to specify the exact item you want to go and manipulate. All your write operations are going to be with those single item actions. There’s not like an update where you can update a bunch of records given some predicate.
Now if we go back to that composite primary key and think about how that partition key is used mostly to assign records to the same storage partition, we can see that records with the same partition key in this composite primary key table are going to be grouped and stored next to each other. They’re going to be ordered according to that sort key. It’s really useful to us because sometimes we need to get a set of related records, and that’s where DynamoDB gives you the query operation as well. If you have a table with a composite primary key, you can do this and you can fetch multiple records with the same partition key in a single request.
When you’re doing this, it requires the full partition key because it needs to know which partition to go to to actually fetch those records. You can also have conditions on that sort key to say, maybe, values before the sort key value or after the sort key or between those sort key values. Finally, it’s only going to let you get one megabyte of data per request, which is how it gives you that consistent performance at any scale. It doesn’t want you to have a gigabyte of data that you’re pulling back because that’s going to change your response times.
Additionally, the DynamoDB API has a scan API, which is just fetch all. I’d say you’re going to use this pretty sparingly, other than like exports and different things like that. So given this partitioning and the API, my mental model for DynamoDB is that when I make a request to DynamoDB, I get one contiguous disk operation from some unbounded amount of storage. DynamoDB is just giving me one contiguous disk operation from an unbounded amount of storage.
Dynamo is providing this infinite fleet of storage that expands as much as I need. However, I need to physically think about how I’m setting it up within my application to ensure I can get what I need when I need it. If I want to insert a new record, I need to identify the primary key to determine exactly where that goes. If I need to read a record, I need the primary key so I can find it easily. Or if I need to fetch a set of related records, I need to arrange them so I can use the query operation with all items grouped together and sorted as needed. I can read up to a megabyte of data there.
The key point here is that every API request gives you one contiguous operation from an unbounded amount of storage, rather than with SQL where you get a query planner that hops all over the disk doing multiple reads. Given that you get basically one contiguous read, I really love the DynamoDB API. I think it works very well with the partition key, and you need to understand that. Given that, I would say don’t start with the PartiQL API. There is a PartiQL API, which is a SQL-ish API you can use to query DynamoDB. Under the scenes, it’s basically just turning it into one of these operations: a single item action, a query, or something like that. However, I think it hides a lot of what you should actually be thinking about in your application: how do I want to arrange my data to fit with the DynamoDB mental model?
Secondary Indexes: Enabling Additional Read-Based Access Patterns
The last quick concept we need to cover is secondary indexes. We’ve seen that on this table we can fetch records by the primary key. I can fetch the Alex DeBrie record or fetch Madeleine Olson by username. But what if I want to fetch someone by a different attribute, like their email address? That’s where the concept of secondary indexes comes in. You can create secondary indexes on your table, and it’s basically a fully managed copy of your data with a new primary key. This enables additional read-based access patterns by repartitioning your data and reorganizing it so you can fetch it more easily. We can create a secondary index on the email address, and now we have this email index which allows us to fetch a record by a given email address.
There are two types of secondary indexes. The first kind is a global secondary index, which you should prefer in almost all cases. The second type is a local secondary index, which you should really understand before using. I was going to do a deep dive on why and when you should use local secondary indexes, but I had to cut it for time. I would just say that a local secondary index is kind of a one-way door and has some serious downsides. Make sure you really understand a local secondary index before putting one on your table.
To summarize and get us on the same page with some basic facts, make sure you understand the concept of partitioning, using the partition key to spread your items across your table. Understand how the DynamoDB API works to provide consistent performance at any scale and the importance of the primary key in that. Then understand how secondary indexes enable additional read-based access patterns and the consumption-based billing, which I think is unique and pretty interesting about DynamoDB. You get predictable and visible billing there. Let’s move on to some high-level concepts around data modeling goals and process.
Data Modeling Goals and Process: Keeping It Simple
The first thing I would say is that you want to keep it simple as much as possible. I think a lot of times when I see errors or issues with people’s DynamoDB data models, it’s just that they’re more complex than they need to be. I was guilty of this for a long time as well. I saw this recently on Twitter about how a novice does too much in many different areas, while a master uses the fewest motions required to fulfill their intention. So keep it simple as much as you can.
I like to think about what your modeling meta goals are, regardless of what database you’re using. What are we doing when we’re doing data modeling? Number one, you have to think about how to maintain the integrity of the data you’re saving. Ultimately, a database is serializing and storing some state of the world, whether that’s representing physical objects like inventory, offices, or people, or digital stuff like social media likes and comments. You need to be able to maintain the integrity so when you read that back out and represent it, you can still understand what you have in your database. Additionally, in addition to maintaining integrity, you need to make it easy to operate on the proper data when you need it. If I have one of those big tables, how do I get down to just the records I actually need?
Thinking about that in the context of DynamoDB, how do we apply that? When thinking about maintaining the integrity of the data you’re saving, the first thing you have to do is have a valid schema. DynamoDB is not going to maintain your schema for you like a relational database would. DynamoDB is a schemaless database, so you can write whatever attributes you want, which means you’re going to maintain that in your application code. It’s an application-managed schema rather than a database-managed schema. You should have some sort of valid schema in your code. I use TypeScript and Zod, but you can use whatever you want. When you’re writing records into your DynamoDB table, especially if you’re inserting full records, you should almost always be validating that you know what you’re writing in and that it’s valid data.
You can put it in there because you don’t want to put junk into your database. The same applies when you’re reading data back from DynamoDB. You should parse that out and make sure the shape matches what you expect. If it doesn’t, throw a hard error rather than limping along. You should understand that you’re getting something back that you did not expect. Where did that issue happen? You don’t want to keep corrupting your data worse and worse over time.
So that’s a big one: make sure you have a valid schema somewhere in your data. Additionally, when you’re maintaining the integrity of your data, you want to maintain constraints across items. If you have some uniqueness requirements, you don’t want to have multiple users with the same username. You need to maintain uniqueness that way, or maybe you have some sort of limits you need to enforce in your application. How are you going to do that? That’s where DynamoDB condition expressions are going to be your friend, maybe transactions, but think about those constraints and make sure you’re modeling for them.
Then with DynamoDB, sometimes we duplicate data. Sometimes we do a little bit of denormalization. So think about how you avoid inconsistencies there. When you’re duplicating data, think about whether this data is immutable. Is it ever going to change? Sometimes you will have functionally immutable data. If someone makes an order and you want to store the payment method that they used for that order, even if they change something about that payment method later on, you don’t really need to go change that order. You’re capturing a snapshot of what the payment method was at that time.
But sometimes you’ll be duplicating mutable data. If I have mutable data, how am I going to update my data when it changes? First of all, how many times has it been copied? Has it been copied five times or a thousand times? How do I identify the items to which it’s been copied where I need to go update all these different records for this data that’s changed? And probably the hardest question now: how quickly do I need to update it? Do I need to update it in the same request, where if I’m updating that parent record and now it’s been spread out to five different items, do I need to do that in a transaction to make sure they’re all updated at the same time? Or can I do that update asynchronously? Am I going to have data inconsistencies across that? What does that mean for my users or clients?
So be thinking about this when you’re duplicating data. That’s what I think about when I’m talking about maintaining the integrity of the data within DynamoDB. Then you also want to think about how to make it easy to operate on the right data when you need it. That’s where your primary keys are coming in. If you’re writing, think about what’s your proper primary key to maintain uniqueness and what’s the proper context. How do I canonically identify a record that I’m always going to have available that I can use to address that relevant item? When I’m reading, what’s the primary key structure? What are the indexes I need to filter down efficiently rather than significantly overreading and filtering after the fact?
Those are some meta-goals we’ll keep in mind as we look throughout this. Just a quick run through the data modeling process: I always say that most of DynamoDB data modeling happens before you write any application code. You should be thinking about this, writing it down, and then the implementation aspect is actually pretty straightforward. First thing you need to do is know your domain. What are the entities in my domain that I’m actually modeling? What are those going to look like? What are the constraints that I have in my application? What’s the data distribution? If I have a one-to-many relationship, how many can that related aspect be? Is it ten related items per parent, or is it a thousand or a million or something unbounded?
How big are these records because that’s going to affect modeling and some of the choices I’m making there. Then with DynamoDB, you want to know your access patterns up front and model specifically for your access patterns with your primary key. I always say be very specific with this. You should actually list out what are my write-based access patterns and go through those mechanically. Same thing with your read-based access patterns. As you’re modeling your primary keys, you should make sure I know how to handle each one of these. If I have conditions in my write operations, do I have that set up properly? All these sorts of things. So know your access patterns and then the last thing: just know the DynamoDB basics, the things we talked about before, the primary key API and secondary indexes. That’s going to do most of it for you.
Multi-Attribute Composite Keys: A Game-Changing Release for Secondary Indexes
So please just keep it simple on this stuff. I think the basics are going to get you a long way. Using those single item actions for your write operations and your individual reads, using some queries for range lookups and list operations, sprinkling in those secondary indexes when you need them for additional read-based patterns, and then sometimes using transactions sparingly for certain operations. All right, so that gets us out of sort of background conceptual type stuff. Let’s go apply it somewhere. And I want to start off talking about secondary indexes. The reason I want to talk about this is because there was a huge release just two weeks ago about how DynamoDB now supports multi-attribute composite keys for your GSIs. This is a huge release. I think this will simplify a lot of things for people. But in terms of walking through why this is useful, let’s look at an example table we have here that’s just tracking orders within a warehouse within some system. We have multiple warehouses, and we have assignees within those different warehouses that have to go process those orders, pick them, and make sure they’re all ready to be shipped.
So we have these different attributes on our table. We might have some sort of access pattern that says, for this warehouse, for this assignee, for this status, what are the things that they should be working on?
You might see some sort of attribute like this in your table: GSI 1 PK and GSI 1 SK, which are synthetic keys made up of other attributes that are already in your table. If you look at this, we have the warehouse ID put in there, then we have a pound sign, and we have the assignee ID jammed in there. Then we’ve got status, we’ve got priority, we’ve got created at—all of this is made up into these synthetic keys in our GSI PK and SK.
This was a very common pattern we used to have to do with these synthetic keys, where we’re manually combining these attributes to partition, to group, and then sort as needed. I didn’t realize how much I disliked these until this new release came out, because there’s a ton of downsides to this. The number one is just the item size bloat.
If you look at that item that we have in our table, the meaningful attributes in our thing there are 100 to 101 bytes, so a pretty small record. If you look at the other attributes—these synthetic keys—they’re 92 bytes. So this is almost half of our item. It won’t always be half of your item because you’ll actually have larger other attributes there, but if you have two indexes with GSI 1 PK, GSI 2 PK, and SKs as well, you might be talking about 200 bytes, which is 200 bytes you’re storing on every single item.
You’re paying storage for that. 200 bytes is 20 percent of a WCU, so it’s likely to kick you over the WCU limit a lot of times. So every time you write, you’re paying for an extra WCU and into index replications as well. This is just a lot of cost for very low value. These attributes are already on your table.
There’s item size bloat, and there’s also just the application modeling complexity. When you’re doing all your access patterns and setting up these indexes, you have to think about putting all these together and have I done it right? Have I implemented it right in my application? There’s sorting complexity around taking all these attributes and turning them into a string. But if one of those attributes is an integer, now you have to sort that integer like a string, so you have to zero pad it to the longest length it could potentially be and think about that sort of thing.
Then there’s the update logic. It gets harder. If someone comes and says, hey, update the status for this order—it’s no longer pending, it’s prepared or whatever—I have to know all those other values. If I don’t know all those other values and I have to read that record to pull down those values just to do my update, it’s kind of a pain to do all this sort of stuff.
That’s the old way. But now we have these multi-attribute composite keys. The way this works is you can support up to four attributes each for the partition key and the sort key when you’re creating your secondary index. If we have our existing table and we want to use this multi-attribute composite key pattern, what we do is when we’re creating our partition key, we say, OK, I want my first element to be that warehouse ID. I want my second element to be that assignee ID. I want that third element to be my status.
Same thing with the sort key—I want the first element to be that priority. I want the second one to be that created at. Now I don’t need those GSI 1 PK and GSI 1 SK values at all anymore. They’re just reusing those when they create my secondary index to know how to set up that index.
If we go back and look at those downsides, how does this work with our multi-attribute composite keys? We don’t have that item size bloat anymore because it’s actually reusing the actual attributes in our table. We’re not bloating it up with another 100 or 200 bytes on our table. It’s a lot easier to reason about because when I’m writing or updating an item, I don’t have to think, OK, which other GSI synthetic keys do I have to update as well?
We don’t have that sorting complexity. If one of my partition or sort key attributes is an integer, it sorts like an integer. It doesn’t sort like a string, so you can just do normal sorting on it. Then my update logic is a lot easier because again, I’m only updating the actual attributes in my table. They’re handling all the work for that.
In terms of how it works, you get up to four attributes each for your partition key and sort key. So you can still just use one attribute for each if you want to, but you can specify up to four. Now, when you’re doing a query operation, you have to specify in your key condition expression that you have to include all your partition key attributes. Because it’s the same thing, you still need to know exactly which partition you want to go to, where this data is located, so you need to make sure you have all your partition key attributes in there.
You can do conditions on that sort key as well. The important thing is that sort key is going to be ordered. The ordering of those attributes matters, and I would think of it like a SQL composite index. It’s basically left to right, no skipping, stops at the first range. So if you have four values in your sort key, you can match on all four of them, you can match on the first three and do a range on the fourth one.
But what you cannot do is do an equality match on the first attribute and an equality match on the third attribute without providing a value for the second attribute. It will stop at that one and just scan there.
The one thing I will say is this will not solve your overloaded index issues, probably. So if you are doing single table design in this case where we have some user entities in one table, we also have some organization entities. You can see we have these GSI 1 PK and GSI 1 SK values here for a secondary index.
If we look at our secondary index, we have an item collection, which is a set of records with the same partition key. We have an item collection that contains two different types of entities. We have pre-joined these entities, organization and user, in that same item collection. This is going to be hard to do with those multi-attribute composite keys because it is unlikely they are going to have the same attribute names across these different entity types. So for our partition key, they both have organization name, we could use that as our partition key. But you can see in the sort key, the username is coming from that username on the users. The organization is just a static string that we have here, so it would take a little bit of work to do this if you are doing these overloaded index patterns. So it probably will not help that one there, but in all other cases, this is going to be a huge win for you.
Cost Management with Secondary Indexes: Strategic Use and Optimization
All right, so for secondary indexes again, use these multi-attribute composite keys. This is huge. I would use this for almost all cases except for those overloaded indexes. Honestly, for existing tables, this might make sense too just to save on item sizes. Create a new index with this, switch over to it, and drop your old index. You can stop writing that synthetic key, which could actually save you money depending on your use case.
In addition to that, let us talk about cost management with secondary indexes because I think this is undervalued. Every time you are writing to a secondary index, it is going to consume write capacity units. Secondary indexes are a read-time optimization for which you pay for writes. But writes are more expensive than reads, right? A write capacity unit costs five times as much as a read capacity unit, but it is also only one-quarter of the max size, so it is going to be five to twenty times as much as reads depending on the size of your items. So make sure you are getting the value from that.
Here are some cost management tips on secondary indexes. The first thing I think is, do I actually need a secondary index? Because I think a lot of times we will write our access patterns, we will solve that first access pattern with the primary key in our base table, and then we will say, okay, every other read-based access pattern, I am just going to add another secondary index for that. But now in this case, I have three secondary indexes. Every time I write my item, I am going to have to replicate to each one of those. My write costs are now four times as much as they would be. So make sure you actually need all your indexes.
Sometimes you can reuse secondary indexes for multiple different use cases. I would say the two areas I usually see this is like if you have a really high correlation or overlap between different read patterns, sometimes you can do that. I had a talk recently which was about an order delivery app, something like DoorDash. Imagine you want to show all the orders for a given restaurant over time. They want to say, hey, what orders did I have last month or the month before that? They are grouped by restaurant, ordered by time. But also that restaurant wants to say, hey, what are my active orders that I should be working on now? I want to put up on the board in my restaurant to make sure people are working on them.
Well, the thing is, all your active orders are going to be the most recent orders. You are not going to have an order from two weeks ago that you forgot to deliver and you need to be working on now. So you just look at the last fifty or one hundred orders, filter out those that are actually completed, and those are your active orders. You do not need a secondary index for that one.
The second place you can reuse a secondary index is just when your overall search space is pretty small. So searching for admin users within all users. Like if you have a SaaS application where organizations sign up and they have lots of different users in there, and somewhere deep in your user management page you want to look for just who are the admins in my application. Well, if you only have like one hundred, two hundred, or three hundred total users max within a given organization, you probably do not need this separate index just to show admin users. You can just fetch all the users and then filter down to admin after the fact.
My rough rule of thumb here is like if fetching that total search base, in this case, all the users for a given organization, if that is less than one megabyte, which is one DynamoDB query page, I would say usually do not need a secondary index for that, depending on how many times you are reading from it and different things like that. All right, so that is the first one. Do I need an index at all? Can I avoid having an index? The second one is, if I do need an index, do I need to put all my items into my index? And this is where the sparse index pattern shows up.
The thing about secondary indexes is DynamoDB is only going to write items to your secondary index if they have all the index key attributes. So if you have that GSI 1 PK and GSI 1 SK, or if you are using these multi-attribute composite keys, it has to have all those different elements to be replicated in your secondary index. And you should use this to your advantage.
If we go back, we had that orders table that we showed in the beginning, let’s say we had an access pattern that said find all the orders for a given customer that had a support interaction. Maybe what we do is when an order has a support interaction, we add this support interaction at timestamp attribute on it just to indicate when that interaction happened. Notice that not every record is going to have one of these.
If we set up a secondary index on that table using that support interaction, partitioned by that username and sorted according to that support interaction at timestamp, now when we have our order support index, it’s only going to have the subset of items that have both of those attributes. We’ve filtered out all the records that don’t have a support interaction. So again, use this to your advantage both from a modeling perspective. This is like a global filter over your data set. We basically said where support interaction is true. You want to be doing filtering with your primary key, with your partition key, with your sort key, but also with your secondary indexes using that sparse index pattern, which is another good way to filter data.
It’s also going to reduce costs because now you’re not paying to replicate those items that you’re never going to read from this index, or you don’t have to overread and filter out records that don’t match your conditions. So that’s the second cost management tip. First of all, do I need an index? Second, do I want all the items in my index? The last one is, do I need the full item in my index, where you can choose how much of that item to project into your index.
I used to say just project the whole thing, but that actually can get really costly in a lot of ways. Think about our user record again, and maybe we have a user detail page that has a lot of information about that user. It’s got this long bio section, preferences, an activity log, maybe we’re persisting some of their most recent actions on there, just a lot of stuff. But if we have a list users access pattern, we don’t need almost any of that data. If you look at that, we need a user ID, name, email, just a few little bytes of information.
So if we’re creating a list of users in an organization access pattern, we don’t need to replicate all that or project all that data into that index. With your index projection, think carefully about which attributes you’re projecting into that secondary index because there’s significant savings you can have from not doing that, and it comes in three ways. Number one, it’s going to reduce the WCUs that are consumed for a write. If I have this five kilobyte user record but I’m projecting less than one kilobyte of it into my secondary index, I’m reducing my WCUs from five to one, that’s eighty percent savings right there. But even better is that it prevents some writes entirely.
If a user goes and updates their bio and I’m not replicating that to my secondary index, it’s not going to update that record in my secondary index. I skip that write entirely, now I save one hundred percent of that write for that secondary index. Additionally, it’s going to help you on the read side because now when you’re reading all those users, you’re not paying over WCU for each record you have there. You’re just paying for a much smaller item. It’s going to reduce the number of pages when reading, so really think about your projections carefully around this stuff. I’d also say it’s not a one-way door.
You can create secondary indexes after the fact. We’re going to look at that in the schema evolution section, but you can if you need to change your projection over time, create a new secondary index either with a larger or smaller projection and then drop your old index and start reading from the new one. So that’s secondary indexes. Again, key takeaways here is use this multi-attribute composite key pattern wherever possible. I think it’s a huge addition that’s really going to reduce your costs and simplify a lot of your logic there.
Schema Evolution: Addressing the Myth That DynamoDB Can’t Change
Look into sparse indexes for global filtering over your table. Then think about that index cost flow. Do I need that index? If I do, do I need to replicate all those items into my index? And then finally, do I need to replicate the full item into that index? All right, next, let’s talk about schema evolution. I get this question a lot where we talked earlier about how you have to know your access patterns in DynamoDB, and that leads a lot of people to say DynamoDB is great if your application will never change.
I don’t think that’s true. I’ve worked on a lot of different DynamoDB applications and they’ve all evolved over time in different ways, so I don’t think that’s true. What I think is true is that certain patterns are always going to be hard to model in DynamoDB, and sometimes those come up later and then you feel frustrated thinking this is too hard to do. So let’s talk about some patterns that are just always hard in DynamoDB before we actually move into schema evolution. The big ones are going to be if you have any aggregations around your table. DynamoDB doesn’t have native aggregation functionality. You’re going to have to write it in your application code. So if you have questions like "How many transactions has this customer done each month?" or "What’s the largest purchase done by customer Y?" it’s tricky for DynamoDB. You’re going to have to manage some of that yourself.
I think the more common one, and the one that comes up when people are saying "Hey, DynamoDB can’t evolve" is complex filtering needs. I say that’s when you’re filtering or sorting by two or more properties, all of which are optional. That’s really hard, right?
Let’s say you have just a list of records in a table view, and you want to show your users lots of different attributes. You want to let them choose which fields you’re filtering by and sort by different things. That’s really hard, right? If I go and say, "Hey, find me all the trips by this company that came out of the Omaha airport and maybe were over 500 miles and within this time range," this is going to be a really hard pattern to model in DynamoDB, even if you knew this on day one before you wrote any code for your table. This is a hard pattern to model.
You can’t do it easily. I’ve talked about complex filtering more over the last couple of years, so you can look at that. There are some ways to do it in DynamoDB. Sometimes you want to use something else like OpenSearch or ClickHouse or something like that. And sometimes you’re just like, "Hey, you know what, this would fit better in a different database." But complex filtering is a hard one to do. So this is true: certain patterns are always hard in DynamoDB, and if they come up later, they’re going to be hard. But that’s not because evolving DynamoDB is hard. It’s because this pattern is hard in DynamoDB.
Three Types of Schema Evolution: From Application-Only Changes to Data Backfills
So I do want to talk about more traditional schema evolutions that you do see and how you handle that in DynamoDB for access patterns that actually fit within DynamoDB. I’ll do that with just an example here, which is a support ticket application, right? Customers can come file support tickets, they get assigned to different users, and we’d have some sort of table like this with a partition key of our tenant ID. We have our different tickets all in this table that have our different attributes on them.
Now, as your application is evolving, I would say first you want to understand the type of evolution you’re performing. I think there are three main types of evolution that you’re going to see pretty commonly in your application, and the way you handle them is just a little bit different, right? So the first one is you might have a schema change that does not affect data access, right? You’re not fetching based on this schema change.
So if we go back to our support tickets here, maybe just on the left here, we’ve added these little badges based on the tier of the customer, right? Maybe they’re a platinum customer, maybe they’re gold. All we’re doing is helping our support agents understand what that customer tier is. But you see there’s no filtering on that customer tier or anything like that. It’s purely this little badge that we’re putting on there. So in this case, we’re adding a new but unindexed attribute. We’re not indexing it. And this is generally the easiest type of evolution to handle with DynamoDB, right?
We talked about how DynamoDB is schemaless, so that means you can just start writing new attributes to the table for new items as you want, right? So we have this new customer tier attribute that we’re starting to write to our item. Notice that some items don’t have them. Existing items might not have this customer tier attribute, and we’re okay with this, right? This is just like if you’re changing your SQL table to add a new column with a default value, but now that default value is probably going to be in your application code rather than in DynamoDB, right?
So this is the easiest type of evolution. What you want to do is update that schema in your application code. That’s going to be mostly where you handle that. Add default values and change your schema to handle that. So we talked about having that valid schema, that modeling meta goal before. If we had our different ticket schema here, we might add this new customer tier attribute on the bottom. It has the different values it can be. It has a default for items that don’t have that particular record. Depending on how complex our schema change is, maybe now you need some versioning of different schemas. The first thing you do is sort of detect that version. Maybe you have to parse the ticket differently based on what version it is and then sort of normalize it into some sort of schema. But you can mostly handle this within your application code.
That being said, you can handle it completely within your application code. However, you might decide you actually want to backfill all your existing data, right? There are a couple of reasons for that. Number one is you might end up with schema bloat over time where you have twenty different versions of your schema, and it’s hard to reason about. Like, okay, if I have a V2 item, how does it get to a V16 item or something like that? So it might be easier rather than managing that. You just say, "Hey, I’m going to backfill my existing records and handle those." Or another thing is you might be exporting your data to external systems, OpenSearch for search or ClickHouse or S3 and Athena for analytics, right? And while you can handle the default values in your application for your OLTP stuff, now you also have to communicate all those values to whoever’s maintaining those systems, and it can be hard to deal with. So you might, just for long-term data hygiene reasons, say we’re going to backfill and update this new value on existing items. If that’s the case, now you’re out of this first type of evolution. You’re going to be into the third type we’ll talk about in a second. But at the very least, what you can do is handle this completely in your application code. That’s handling a schema change, adding new attributes, renaming attributes, things like that, that does not affect access. This is a mostly easy application-only change.
The second type is a new index on an existing attribute. If we look at our support tickets, at first we’re just showing them in a flat list with no filtering. This works well when we have 5 tickets, but over time we’re going to have 5,000 tickets or 500,000 tickets or 5 million tickets. Now we need a way to filter down to just my tickets. I’m a support agent, and I want to filter by assignee so I can say, just give me the tickets that I have. This goes back to our modeling and modeling meta goals—making it easy to operate on the right data when we need that data. So what we need is a new index on an existing attribute. The good thing is global secondary indexes can be added at any time. You can go in and add a new secondary index to your table, and DynamoDB is going to handle that for you. If we go back, you can see here we already have this assignee value. We can set that up as our partition key. We can use created at or ticket ID as our sort key. DynamoDB is going to do the work to backfill that for us, and now we can query from that accordingly.
The general process for this is number one, you add that index, whether that’s in your infrastructure as code tool or maybe you just do it directly in the AWS console. DynamoDB is going to kick off a backfill for you and basically scan all the existing items in your table and write and replicate them into your secondary index for you. Once that backfill is done, then you can start querying your secondary index. You can’t query it until that initial backfill is done, so that’s where you add the application access pattern to start reading from that index. This is a fairly easy, straightforward change—I want to access my existing data in different ways. DynamoDB is going to do that backfill of your new index for you, and it’s not particularly hard.
The third type of evolution that you’ll go through commonly is when we need to make some change to existing data. As we talked about doing a backfill before, the example I came up with here is we have a lot of records with things like priority and status, but maybe what we want is to add this urgent button where I can filter down to just the most urgent tickets and things I want to handle there. If I click that button, it filters down these urgent tickets so I know what I want to get d