🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Delighting Slack users safely and quickly with Amazon Nova and Bedrock (AIM384)
In this video, Gene Ting from AWS and Austin Bell and Shaurya Kethireddy from Slack discuss how Slack scaled its AI features to millions of users using Amazon Bedrock and Nova. Slack migrated from SageMaker to Bedrock, achieving over 90% cost savings (exceeding $20 million) while expanding from one to 15+ LLMs in production. The team developed an internal experimenta…
🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Delighting Slack users safely and quickly with Amazon Nova and Bedrock (AIM384)
In this video, Gene Ting from AWS and Austin Bell and Shaurya Kethireddy from Slack discuss how Slack scaled its AI features to millions of users using Amazon Bedrock and Nova. Slack migrated from SageMaker to Bedrock, achieving over 90% cost savings (exceeding $20 million) while expanding from one to 15+ LLMs in production. The team developed an internal experimentation framework measuring quality through objective metrics, LLM judges, and Bedrock Guardrails. By switching to Nova Lite for search query understanding, they reduced P50 latency by 46% and costs by 70% with no quality regression. Overall, Slack achieved a 90% reduction in cost per monthly active user while increasing scale 5x and improving user satisfaction by 15-30% across features.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Evolution of Generative AI Applications at re:Invent 2025
Hey, good afternoon, everyone. I hope everyone’s enjoying their re:Invent 2025 so far this week. Thank you so much for coming today. Before we actually kick things off, I just want to take a quick show of hands. How many of you have already built a generative AI application to actually help some of your users out there, trying to make their lives more effective or more productive so far? Okay, so a quick show of hands, don’t be shy. It’s a good showing actually. That’s really a testament of how amazing the industry has transformed over the last couple of years.
If you think about it, it all started with some early research and development that actually led to the first set of frontier models that showed us, showed the entire world, the capability and the art of the possible for letting users interface with the system in their own natural language and creatively responding in kind. With that, it all led to a whole lot of tools today that not only could actually go and search for information for you when you can’t answer the question yourself, but it generates new rich media content based on a set of descriptions or captions that you feed it. It can even generate brand new code for you that can either just prototype a new concept you want to share with people, or you want to fill a gap or do some small repair on your applications. That’s how quickly the industry evolved in the last couple of years.
But I think for everyone who has said that they deployed something out there, they probably also seen all the challenges involved in actually getting that type of application to a production level quality that you feel comfortable giving to your users over time. So with that, my name is Gene Ting. I’m a Principal Solutions Architect at Amazon Web Services, and joining me today is Austin Bell and Shaurya Kethireddy from Slack as we’re going to talk about their journey on overcoming some of those challenges so that they can actually delight millions of Slack users around the world with Amazon Bedrock and Nova.
Production-Ready Generative AI: Key Design Considerations and User Requirements
So to start things off, making a generative AI application production worthy follows some similar steps as what you would have thought about in the past with more traditional applications on the higher level. You’re going to have to think hard about who you’re building towards, the types of personas they are, which defines the user requirements and the functions that you’re going to be thinking about to design your application. Once those decisions are made, you’re going to have to think about reliability, operational excellence, cost management, especially as your application grows over time. You always want to make sure that you could protect the application from harmful damage to it, protecting your user session and their data.
But what’s kind of new and an interesting nuance that comes for generative AI applications is how to make sure that your system can actually respond to your users in a safe manner and how to make sure that it’s only handling the types of workflows and types of requests that you designed the system for. So some key factors to think about when you’re defining your user requirements and designing a generative AI application includes a couple of facets here. Sometimes you’re going to have to think about whether or not you’re going to build functions through static LLM workflows. Sometimes you’re going to want to see how you could apply more dynamic agentic workflows, but the common thing about these is that they all involve different levels of complexity based on the task that you’re asking to accomplish.
Sometimes you want to respond instantaneously or as quick as possible in real time, such as answering a search phrase and giving back a rich summarization of search results. Sometimes you want to take a whole lot of conversational threads and text and summarize that on a daily basis. Those are more latency insensitive types of workflows over there. But the main thing is also, just like traditional applications, you always have to be cognizant and aware around latency sensitivity for your users based on the type of use case that’s being considered at the time.
This is where Amazon Bedrock comes in. When we talked about different complexity for different tasks, you have a lot of selections of models within Amazon Bedrock to choose from to actually be able to accomplish those tasks.
Making those choices, having that select option and making those choices, helps you see which ones are best for the type of use case you’re trying to accomplish and also helps with managing costs over time as well. Similarly, with figuring out if it’s real-time or asynchronous response, you also have choices with Bedrock. You could choose what provision throughput, priority tiers, or just a standard tier to answer most questions. You could defer some inference to either batch inference or to the flex tier if you wanted, if you’re not in such a rush to actually respond to those questions. Making those choices also helps with managing costs.
As your application matures and improves over time, you want to have a strategy on how you’re going to upgrade and replace those models, either upgrade to a new version of that same model or upgrade to a different model that you see could also accomplish the same task. You also have to have a way to optimize the model responses iteratively over time. This is where Amazon Bedrock prompt management and prompt tuning comes into play. You could use LLM as a judge and also have a set of prompts that you catalog so you can actually apply them to different models to compare over time. You could use prompt tuning and use the prompt optimizer to get suggestions on how you could basically submit different prompts and context and see what the results are.
Amazon Bedrock: Operational Excellence, Security, and Safety at Scale
So after all those fundamental application decisions are made, you want to think about how you’re going to operate and basically run at scale. Starting with a choice around which region you’re going to pick to actually host your application based on your user requirements or regulatory needs, that’s going to define where you’re going to start your interactions with Bedrock. You want to establish a monitoring strategy immediately. From there, you’ll be monitoring for time to first token for your request. You may be using Amazon CloudWatch metrics to actually view end-to-end latency, how often you’re getting errors, and how often you’re throttling. You want to use CloudWatch logs, for example, or your own logging system to actually see how the agents are responding or what the LLM is doing on the back end.
You want to be able to tune for performance, so tuning for performance always comes into play with some sort of caching strategy. You could use explicit prompt caching, for example, on Bedrock to actually be able to cache common and frequently used inputs so that you could save time and costs on transforming that input into input tokens. And we talked about how Bedrock had model flexibility to help with cost management. It also very much comes into play to help you with resiliency as well, so that if for whatever reason the model starts misbehaving or you have some availability concerns around that model, you should always establish a deeper model strategy to be able to have your application covered.
As your application really grows out over time, you are able to use other regions that have capacity that’s also hosting those models. With Bedrock cross-region inference, you can ask it to actually send your request to another region that’s hosting that model, whether it’s on the same continent or halfway across the world as well. Secured by design is always an essential part of productionizing your application. Starting with that basic blueprint we talked about before, the first thing that you can rest assured is that Bedrock has been designed so that model providers cannot access the environment that’s actually hosting their model. And on the flip side as well, we don’t by design share any of the inputs and outputs that are being fed in and coming out from the models that you use on Bedrock as well.
You’re going to apply IAM policies to make sure that only the specific parts and specific users in your organization and in your environment can actually access the models that you have already started leveraging on Bedrock. You’re going to use service control policies to make sure that only the specific models that your organization has approved are actually available within your infrastructure. You even have the added control with VPC endpoints so that reaching Amazon Bedrock through your infrastructure has to go through a specific path. And remember how we talked about using cross-region inference to be able to reach capacity wherever it is available on Bedrock. Based on your specific regulatory requirements or data sovereignty needs, you can even restrict that cross-region inference behavior to only certain geographic locations of your choosing as well.
So coming back to this nuance that we have around making sure your application responds in a safe manner and only handling things that you have specifically designed for, let’s talk about the basic request-response workflow. Looking at the user input side of things, you could apply Amazon Bedrock Guardrails to make sure that you could filter out certain keywords. You could basically deny dangerous content. You could actually pick other types of custom topics that you don’t want the system to handle. And if any of those policies are triggered, you can have Bedrock respond back with a custom moderated response.
On the other end of the spectrum, after your initial inputs have been vetted and then fed into the model, you can have Bedrock Guardrails again apply those same types of policies and add additional automated reasoning checks for correctness before sending the response back to the user. So now that we’ve talked about a basic blueprint on how to go from ideation to production, let’s see how one of our biggest customers, Slack, applies some of these principles and rigor so that they can actually deliver Slack AI effectively, securely, and efficiently. All right, so take it away, Austin.
Slack AI: Delivering Generative AI Features Across the Product
Great, thank you for that introduction, Gene, and thank you all for coming. We are going to spend this time talking about how we at Slack are able to deliver our Slack AI features effectively, securely, and efficiently. Specifically, our goal here is to be able to enable generative AI to permeate throughout the Slack product, and it’s very easy to get to that 80% of a generative AI feature. But what really sets that moat and sets you apart is that extra 20% of becoming more cost efficient or having higher quality. We’re going to spend this presentation going through our journey of how we were able to meet our scale and meet our quality needs here at Slack.
Specifically, we’re going to talk about three different areas. We’re going to discuss how we developed our infrastructure layer that allowed us to meet our scaling needs in a way that was highly secure and met all of our compliance requirements. We’re going to talk about how we developed an internal experimentation framework that allowed us to objectively measure the quality of our generative AI outputs and give us the confidence that we are actually tangibly moving the needle in that quality. Lastly, we’re going to talk about how those two different sections come together to actually enable us to integrate generative AI more seamlessly across the product.
But first, a little bit about who we are. My name is Austin Bell. I am a director here at Slack responsible for machine learning, search, and AI, and I’m joined by my colleague. Hey everyone, my name is Shaurya. I’m an engineer here at Slack working on our infrastructure and platform team. So just to level set a little bit, I’m going to assume some familiarity with what Slack is, but for anybody who’s not aware of what our Slack AI offering is, we support over a dozen different generative AI features and tools that span the spectrum of complexity and use cases.
For example, we offer the ability to do AI summaries across a variety of different product surface areas. We have our in-house QA systems where you can ask questions and find information related to the data that you house within Slack. We have the ability to create daily digests to get up to date very quickly on everything that’s going on. These are just a small number of examples, and we continue to add more on a very regular basis. Now for our first topic, I’m going to pass it over to my colleague Shaurya, who’s going to discuss how we have built our infrastructure layer here at Slack.
Building Trust at Scale: Slack’s Product Pillars and Infrastructure Challenges
Thank you, Austin, for the introduction. Hi everyone, I’m here to talk about how we scale Slack AI to the millions of daily active users that we have. So generally when we talk about generative AI, it usually is around speed, how fast are tokens coming, how fast is a product moving, and how fast is the market moving as well. But here at Slack, we also add another factor of trust. We ensure that the services that we use have the trust that we need to protect the customers’ data and their trust in us to ensure that they’re able to work within our platform.
One quote that we usually go by is the real test of scalable infrastructure isn’t just how fast it grows but also how well it protects what matters as it grows. As you can imagine, a lot of people have all their working content within Slack. We want to ensure that whichever features we do add enable them to keep working with Slack while not having to add any roadblocks in there.
Before going into the technical aspect of how we are scaling Slack AI, I’d like to set a little bit of context about the product promises that we have for our features. The first pillar is trust. Whenever we started scaling out Slack AI, a question that we’ve gotten a lot from customers is, is our data being used to train generative AI models? And the answer is no. We don’t train generative models on customer data. We don’t log customer data. We also allow the admins of the Slack workspaces to opt out of features. This means that the admins can opt into a certain feature or opt out of a certain feature as well. We also have zero retention of the data, and the data that’s being sent to the LLM providers is not being shared. As Gene mentioned, the inputs and outputs are not being shared.
The second pillar is security. We operate in a FedRAMP Moderate compliance space, so we ensure that the services that we use meet that rigorous standard set by the federal guidelines. We also make sure that the services that we do use stay within our trust boundary as there’s a lot of data being moved around. We also have technical access controls, so this means that, for example, if there’s a message in a private channel, we don’t want people who are not in the channel to have access to it. We ensure that there’s security of the messages and even the answers being shared.
The third pillar I’d like to cover is reliability. Having trust and security is not enough. We also need to have the features be highly available. This is what enables the customers to actually use it in their daily process. On top of that, we also want to have contextual relevance. This means that whenever a customer is sending a request, we want to make sure that the answer they’re getting is actually relevant to what they’re asking for, and transparency as well. This means that whenever they do get an answer, we’re able to add citations wherever possible so the customers can backtrack to see the main message that the LLM may have used to generate that particular answer.
Having all these pillars may be pretty easy to do when we have 10 users, but let’s take a look at how Slack is operating right now. Every week we’re processing more than 1 to 5 billion messages, 100 to 500 million files, and 1 to 5 billion searches. As you can see, Slack is operating on a massive scale. We can’t just plug in our AI features into a public API and hope for the best. We need to ensure that the services that we use are properly tested behind the scenes and can handle our millions and billions of requests without breaking a sweat, as well as ensuring our security and trust postures.
Now that we’ve set the context of the products and also the scale of Slack, let’s start with the past journey of how we used to serve Slack AI in mid-2023 up until mid-2024. At the time we had limited LLM options. We had high costs, low flexibility, and we were running in a provisioned throughput environment. From there we have now gone to the future where we have high flexibility, high utilization efficiency, as well as increased reliability.
From SageMaker to Amazon Bedrock: Slack’s Migration Journey
When we first started in mid-2023, we were looking for services that Amazon offered which are within the FedRAMP Moderate compliance space as well as complying with our security and trust postures, so we started with SageMaker. For context, Slack mostly operates within the US East One region, and we have our VPC which contains the Slack instance. From there the requests go through the VPC endpoint, the Slack AI request, which is basically allowing the request to go through the internal networking of AWS, and that request goes to the SageMaker endpoint, which is basically the wrapper around the model. When we first started, we only had one model, and this model was served in a provisioned throughput manner. You may also see a small box in our VPC
called the concurrency checker, and this is what we use to maintain load. At times of peak load, we were able to shed the lower priority requests. For example, we have three priorities internally: the highest priority, which is the most latency sensitive; the medium priority, which can be done within a five to ten minute SLA; and our third tier priority, which is like our batch jobs which run overnight. During times of high concurrency when we have our cluster pretty much at peak utilization, instead of having to scale up, which could take more than an hour at that time, we were able to load shed pretty quickly.
Though this worked during the initial days of Slack AI when we were still scaling up, we noticed that Slack AI was also getting added exponentially to the customer base. So we noticed a couple of problems. The first one was peaky traffic. We had two different types of requests coming in: the time sensitive ones and the batch workloads. These are pretty predictable in our day to day scenarios, but they had different consistencies of throughput. Also, it was difficult to obtain GPUs at the time. It was a GPU crunch, so it was taking us weeks to get the GPUs that we needed to scale up for our new customer bases who were added to Slack AI. Because of the GPU crunch, we weren’t able to scale up and down easily. We were having to maintain our GPU instances in our on-demand capacity reserve so that we were able to hold on to those during the non-peak times.
The impact of these two problems was that we were overprovisioned for the majority of the day, and this was because of having to keep our GPUs up and running. Because of this, we had our infrastructure scaled to support the peak traffic, which isn’t ideal as this basically causes cost inefficiency. We’re paying for instances and GPU time which we aren’t actively using to serve customer requests. Because of this fixed cost that we have, we weren’t able to diversify our LLMs. We only had one or maybe even two, but this also slowed us down when experimenting with new models or even adding new features.
Keeping these problems in mind, we had a vision about where we wanted to go. We wanted to look for a new service which is managed but also gives us a diversity of models. This is when we were able to come across Bedrock. In mid-2024, Bedrock became FedRAMP Moderate compliant, and we were able to take a more serious look into the service. We were able to serve all these requests within our trust boundary, and Bedrock also promises that the inputs and outputs don’t get shared with the providers so that they’re not able to train on our inputs and outputs. Bedrock also has a whole collection of frontier models, so this was enabling our product engineers to add more features. With Bedrock too, we’re able to see models getting added to the model registry at a faster rate, so within a day of whenever these models get published, they’re within the Bedrock platform.
Once we essentially set Bedrock as our service to migrate to, we followed a couple of migration steps to ensure that we have a very reliable migration while keeping customer trust and no downtime. The first aspect was to understand the Bedrock infrastructure, and the two differences were the provisioned throughput and the on-demand. Because in the SageMaker world we were in the provisioned throughput infrastructure type, we decided to keep the same in Bedrock to make the migration easier and then handle the on-demand transition at the next stage. The second part was to do some internal load testing and doing some compute calculation, so this means that we were working on scientifically figuring out what is the equivalent of Bedrock model units to SageMaker compute that we already had so that we were able to get the equivalent compute without having to guess and check.
Here’s an example of one of our load tests. We were able to run Claude Instant and Claude Haiku on both SageMaker and Bedrock, and as you can see for Claude Instant, it’s like a one to one mapping. So when we did run SageMaker on a P4D, it was equivalent to a Bedrock model unit. When we ran a Claude Haiku instance on a P5 instance type, it was two P5s equivalent to one Bedrock model unit. By using these ratios, we were able to get the equivalent compute, and we worked with the Amazon Bedrock service team to get that delivered to our account.
Once we got that compute delivered to us, we were able to start running shadow traffic. This means that whenever a request was sent to SageMaker, we were sending a duplicate request to Bedrock so that we could get an understanding of the internals of the service. This also helped us out while building out our monitoring dashboards as well, so we were able to verify the latencies, the time of first token, and the other metrics that we had in place.
Once the shadow traffic was running at 100% for two weeks, we started to do a full cutover process. This was in branches too, so we started with 1%, 5%, 10%, and 100%. This essentially means that instead of the shadow mode, now we’re serving the response that was sent by the Bedrock service instead of the SageMaker service. By going to Bedrock, this helped us save money and we were also able to experiment more, but we noticed a couple more gaps that we could work towards to make this service more efficient.
Achieving 90% Cost Savings: Transitioning to Bedrock On-Demand and Cross-Region Inference
The first one was we weren’t able to scale Provisioned Throughput. It was still a static compute cluster, and because of this we were still experiencing some cost inefficiencies. Within the platform on our side, we noticed that it would be more helpful if we were able to add backup models. During the times when Bedrock models had regressions or we were noticing some features not performing as expected, we were able to add backup models for them so that we could essentially reroute them without having to do any code changes during incidents.
We also added features for emergency stops for the particular features and the models. This basically goes hand in hand with the backup models. Whenever an emergency stop is turned on, a whole feature can be potentially turned off, and if a model is turned off on our side, then all the requests are rerouted to our backup models. When using Amazon Bedrock too, we were exposed to being able to use different tools that models have. For example, tools, prompt caching, guardrails, and other features that the models do have, and we were able to expand our internal AI platform to enable our developers to have access to these.
From going from Provisioned Throughput, we noticed that we weren’t able to scale it, so this is when we decided to move towards the On Demand world. This is a different infrastructure type as it’s now based more on quotas rather than the bare instances, and this is being done on a tokens per minute, the input tokens per minute, and request per minute basis. We also had a lot of metadata on our side during the Provisioned Throughput era, so we were able to essentially calculate the requests per minute and tokens per minute from our metadata and pass that compute request to Amazon Bedrock to get those quotas delivered to our accounts.
Similar to how it was with the SageMaker world, we had our Slack app within the US East One region. The VPC holds the Slack instances. The request goes to the VPC endpoint which does all the internal routing, goes to the Bedrock service, and then based on the model ID that we add to the request, it will point it to the particular model. One difference here from the SageMaker architecture diagram is instead of a concurrency checker, we had a requests per minute and tokens per minute checker on our side. This enabled us to essentially keep our different features under control, and this means we’re able to isolate certain features from taking over our entire Bedrock cluster.
A different benefit that we also got from Bedrock On Demand is now we’re able to use US cross-region inference profiles. Because we are a FedRAMP moderate shop, we needed to keep it within the US boundaries, and Bedrock provides that ability. By going to US West Two as well, we’re able to get our compute delivered to us at a much faster rate as we have two regions to choose from.
With all of this, we had a bunch of wins at the end of the migration. When we first started, we had one LLM to choose from, and by the end of the migration and current state right now, we’re experimenting and serving greater than 15 LLMs in production. We also have increased reliability. Because of the higher flexibility of LLMs, we’re able to fall back to certain models. We’re able to quickly switch models during times like incident times and even experiment around to see what is the best model to serve for quality and cost. On top of that, our biggest number win was the utilization efficiency.
During the whole migration, we were still experiencing an exponential increase of Slack AI usage as customers were onboarding. But even with that increased customer base, we saw greater than 90% savings, and in dollar values, this is greater than $20 million as well. So to close it off, we started with a quote saying that it’s not about how fast you can scale infrastructure, but it’s about how you can secure your data as well as scaling it. By using the services that Amazon has, we were able to deliver all of our AI features at scale by collaborating with our internal cloud and platform teams as well as collaborating with the AWS side as well.
Measuring What Matters: Slack’s Internal Experimentation Framework for Quality Evaluation
Thank you, and now I’ll pass it to Austin to talk about how we’re able to serve it with high quality. Thank you for that, Shaurya. We’re going to switch gears here a little bit, and we’re going to talk about how we’ve developed an internal experimentation framework that allows us to more objectively measure the quality of our generative AI outputs so we can actually start to move the needle on this quality. Now this is a quote from an engineer at Slack, and I don’t know if you’ve ever been in this situation, but you demo a generative AI feature, you get it really nice looking, and you’re really excited. It comes down to time to productionize it, but you just have to refine a couple of edge cases in terms of quality. You make prompt changes, you make pipeline changes, and you actually start to get that out. But after a few days, you start to notice you actually regressed in certain areas. This turns into a cycle and a bit of a whack-a-mole situation where you’re trying to fix one issue but leading to another issue with actually no quality improvements over the course of weeks.
So why is this the case? Well, evaluating generative outputs is difficult. We are, or have been, in a bit of a paradigm shift from classical machine learning where in online settings you could leverage engagement metrics to evaluate quality, or in offline settings you could leverage things like precision or recall. The outputs of generative AI are significantly more subjective. What I consider a good output may be very different than what you consider a good output. It may also be dependent on the product surface area that you actually want to display this generative AI output. So the question becomes, how do you actually start to measure this in a way that meets your goals and actually continues to allow you to improve?
Now a key thing that we believe here at Slack is that you can really only improve what you actually have the ability to measure. So we first went on a journey of actually trying to define what we wanted to measure. We defined it by two pillars: quality, which is what we refer to as, is the answer giving you what you actually wanted, is it accurate, and safety, are we fostering the correct environment that we want at Slack, are we ensuring that your data is as secure as it possibly can be. Within each of these, we further broke these down. On the quality side, we broke it down into two separate categories.
First, objective measurements. This is what we consider more of our deterministic outputs, things that are somewhat table stakes, being able to render it properly, being able to parse a JSON output or an XML output. Are we formatting IDs correctly? Are we formatting links correctly so you can navigate Slack properly? Things that if we don’t have, the user is going to notice, and it doesn’t actually matter the content of your generative AI because they’re not going to be able to look past some of these issues. On the harder side, we have the subjective measurements. These are things like factual accuracy: are you in fact telling the truth based off of your grounded context? Answer relevancy: are you answering the question that the user actually asked? Attribution accuracy, this is a big problem for us at Slack: are we attributing the correct content to the correct user or to the correct file?
On the safety side, we also broke this down into two separate categories: harm and security. Harm refers to what we measure around toxicity. Are we capturing bias? Are we ensuring that it is a good environment and our LLM is responding in a way that aligns with our values? Second, security: are we protecting against prompt injection attacks to the extent that we can? Are we preventing even search poisoning that you don’t want to do that is unintended or even is intended as search poisoning?
Now that’s step one, defining what you actually want to measure. The next part is actually generating the evaluators that will give you the ability to actually measure this. This has been a journey starting at the beginning. We wanted to do a few things. We wanted to ensure that our product engineers had the ability to very quickly manually review the outputs across small data sets based on their prompt changes or their pipeline changes. We also wanted to introduce a variety of automated programmatic metrics that allowed us to capture in an automated fashion these objective measurements within the quality pillar, things that allow us to ensure that we’re not regressing on formatting or rendering capabilities that we consider table stakes here at Slack.
We’re increasing complexity in this journey, and this is where we are today here at Slack. We start to tackle some of those more complex definitions of quality. We leverage LLM-based quality metrics to be able to measure factual accuracy or to be able to measure answer relevancy. We leverage guardrails to capture safety and harm issues. The overall goal here is that by combining these automated programmatic metrics alongside LLM judges and guardrails, we can start to evaluate the quality of our generative AI on much larger data sets that are much more representative of production. This gives us the ability to run much larger scale experiments and tests.
Now where do we go from here? Well, the goal is to essentially start to develop CI/CD for generative AI, giving us the ability to define verified outputs where we can automate a series of tests so we can capture regressions and lead to quality improvements in a much quicker and more efficient way. Let me take a step back here and talk a little bit about some of the things that I just mentioned. At Slack, we run dozens of different task-specific LLM judges. We work with each product team that develops AI features here at Slack to come up with a rubric on what actually is the definition of quality for their particular feature. This changes for every single feature. Giving the ability to our engineers to only define a rubric without actually having to write any code to deploy these allows us to get these out very quickly.
We also leverage Amazon Bedrock Guardrails, which gives us the ability to measure toxicity, harm, and prompt injection in a very easy way on both inputs and outputs of our LLMs. So we’ve highlighted two steps here. We first defined what we wanted to measure, and then we defined how we measured it. But what really allows our engineers to be productive is developing the workflow that allows them to utilize these capabilities in their day-to-day development experience. If you have a machine learning background, this may seem somewhat familiar to you.
We first do a series of offline experimentation. We start on what we call golden sets. These are verified outputs from our internal data that allow users to make prompt changes to their LLMs and perform manual review on small data sets of 10 to 20. If they feel confident with that, they can move to running their experiments on what we call validation sets. These are much larger in size and are much more representative of our production data. This way, using the combination of these automated programmatic metrics as well as these quality metrics, they can look to capture large-scale regressions as well as ensure that they’re actually making the quality improvement that they intended to.
Now a key thing here is that at each step, we want to provide the fastest feedback loop possible. This gives our engineers the ability to fail as quickly as possible, go back to the drawing board, and ship features more quickly. While our goal here is to ultimately make this as automated as possible, leveraging human in the loop is key to this. We aren’t perfect in a lot of our development of these evaluators, so giving our engineers the ability to actually see the impact of their changes and the LLM response very quickly across data will allow them to move significantly faster.
Now after you’ve validated a lot of this in your offline setting, we move into online evaluation. This is where we start to run AB tests, and we integrate all of those evaluators that I’ve mentioned into our AB tests as well so that you can actually measure both across these quality metrics as well as user feedback metrics.
Before you actually make the decision to roll out to production. The key thing here is that you actually have the confidence that you are making the quality change, you are not regressing in certain areas before too many of your users actually see it.
Stepping back, I’m going to talk about a few things that I just mentioned as part of that. We have three different types of data sets that we typically operate with. Golden sets, which have been manually vetted, are very small, ranging from 10 to 50 samples, and give our engineers the ability to manually review the outputs. Validation sets typically range between 100 and 1000 samples. This is where those automated quality and programmatic metrics really start to shine, as you can imagine, reviewing say 1000 samples is kind of impossible for a human. Then the AB testing where we typically run these experiments on somewhere between 1 to 25% of Slack AI queries for that particular feature.
That’s all nice, but is it worth it? I’ve just picked three examples here that showcase some of the different areas that we look to tackle here at Slack. On the prompt engineering side, we recently changed how we serialize the content that we send to the LLM, resulting in a 5% and 6% improvement in both factual and user attribution accuracy respectively. We oftentimes run model upgrades when a new version of a model or a new LLM comes out. A recent upgrade resulted in an 11% increase in user satisfaction, as well as a 3 to 5% increase in key quality metrics. Now we run that flow in a much more automated fashion for every single LLM upgrade that we do because we’ve actually seen a new version lead to regression and decided to not roll it out. Outside of quality, cost management is a key area for us. Sometimes you want to just maintain quality while reducing costs or improving latency. A recent change allowed us to move to a similar quality LLM, resulting in a 60% cost reduction.
Selecting the Right LLM for the Job: Search Query Understanding and Production Results
We’ve talked about two different things independently, how we built the infrastructure layer which gave us the ability to utilize our LLMs much more efficiently, choose between different LLMs, and gave us the ability to switch seamlessly. We then talked about how we developed this internal experimentation framework that allows us to actually measure the quality of our AI outputs and have confidence that the changes we want to make are the changes we are making. In this section I’m going to go through how we utilize both of those to more seamlessly integrate generative AI across the application in a way that is cost efficient while maintaining the quality and scale that our users expect.
To do that I’m going to walk through a couple of use cases, but before that I just want to highlight there’s a spectrum of generative AI complexity. On the low complexity side you have things that maybe in the past were done by traditional machine learning models, classification, converting unstructured to structured data, parsing data. You jump to the medium complexity where it’s summarization, very basic image generation, content editing capabilities. Then on the high complexity side, a little bit more of what the media likes to talk about, agentic workflows, video generation, tool use, these bigger things. Now here at Slack we have a large number of generative AI applications and they all span the spectrum ranging from low complexity all the way to high complexity. You don’t want to use your state of the art frontier LLM for every single use case, but it’s a tough question to answer like what is the exact right LLM to be able to utilize that will meet your goals without sacrificing quality to your users.
So we’re going to step through a specific use case tackling this low complexity area that we refer to as search query understanding here at Slack. But first I’m going to revisit the scale slide that Shaurya shown highlighting the search area. Now we run a lot of searches here at Slack. While the bulk of those may just
be finding a person, finding a channel, a fraction of those are highly complex searches where you want to ask questions to your data or run searches across a large number of messages. Even a fraction of this number can result in a high cost if you are not running efficiently.
So what is search query understanding? Let’s say a user comes into Slack and adds the following search term. Can you find me the FY25 sales deck that John Doe sent me recently? Now we may not want to send that to our search cluster directly. It might not be helpful, but there’s a lot of information hidden in there that allows us to more target our searches and improve the chances that we’re actually finding the files or the messages that you are looking for. So we run it through a large language model pipeline that allows us to generate JSON that includes filters and other information that might be relevant to you.
Let’s break this down. We can see that the actual query you might want to search isn’t that full thing. Maybe it’s just FY25 sales PowerPoint deck. Now this is just one query, but oftentimes you run multiple queries in parallel, so there might be multiple versions of this search term. You used the term recently, so this gives us the indication that we might be able to condense this to a specific time range in the past three months to increase the chances of finding what we’re looking for. You said that you’re looking for a sales deck, and that likely includes a type of file like a presentation, so can we limit to only threads or conversations that include presentations. You also mentioned that it was John Doe that sent it to you, so let’s only look at conversations or threads that included him.
So what does this actually look like in our search pipeline? User sends a search. We have machine learning models that allow us to very quickly determine is this an informational search or is this a navigational search. If it’s a navigational search, you go through a very specific search and machine learning ranking pipeline. If it’s an informational search, we then go through our query understanding pipeline to try to better target your actual search before then going to the rest of our search retrieval machine learning and then eventually LLM response.
So where do we start here? We had a problem. Our existing LLM that we were using for this case, it worked. It met our quality goals, but it exceeded our search latency budget and it was extremely high cost given the scale that we needed to operate at. So what we wanted to do in this particular use case is we wanted to switch or leverage prompt engineering that allowed us to maintain quality while simultaneously reducing latency as well as our costs. Now the benefit is that we had both of these infrastructure layer changes as well as this experimentation framework, so we wanted to see how we could utilize both of these to ensure that we could select the right LLM for this particular problem while simultaneously having the confidence that we are meeting all three of those goals.
So I’m going to walk you through how this actually looks in our internal experimentation framework. This is just a small summary of what it might look like when we run an offline experiment. We compare the LLM, in this case Nova Lite, to the original LLM that we were using to determine if this is actually leading to the improvements that we want it to be. Most specifically we’re looking at latency as well as key quality metrics. This is a rolled up summary. We actually have dozens of different evaluators that then aggregate up into these summaries. Now this is run on just a very small sample set, but as you can see, we saw significant reductions in latency as well as improvements in quality on this small sample. Now with these offline tests we felt confident that we were actually making the improvements that we wanted to.
So we moved to an online evaluation. We started to run an AB test, and this is just a couple of photos from that tackling the key metrics that we were concerned about. Specifically, we wanted to look at latency as well as quality metrics in terms of our automated quality metrics as well as user satisfaction. What we saw is that we significantly improved latency and we had no significant change across both user satisfaction as well as our automated quality metrics.
So what does this mean? Well, for query understanding, we ultimately switched to Nova Lite, and the result was that we were able to do so after a series of prompt changes with no user-visible regression in quality, while simultaneously reducing P50 latency by 46% and reducing our cost to serve for this particular feature by 70%.
So what have we unlocked here? We have these two independent sections.
These infrastructure improvements gave us the ability to select the right LLM for the future and to be able to run our infrastructure at a much more efficient utilization. We also had our quality and evaluation system, which gave us the ability to perform more objective benchmarking and evaluation of the quality of our generative AI outputs. And so with this, we are able to select the right LLM for the job, maximize the time we actually spend prompt engineering while minimizing engineering time that goes into this, run our infrastructure much more efficiently, and actually have the confidence that when we communicate the improvements that we are making to other teams and to leadership, those improvements are actual.
We do this across dozens and dozens of different features. What has been the result of all these changes in the past year? We’ve seen a 90% reduction in the cost to serve Slack AI by monthly active user. Simultaneously, we’ve increased the scale that we operate at by nearly 5x. We did this all while increasing user satisfaction and user feedback by ranging from 15 to 30% across Slack AI’s marquee features.
Great. Thank you all for taking the time to watch us today. Happy to answer any questions.
; This article is entirely auto-generated using Amazon Bedrock.