LLM Inference Optimization

KV Cache, Paged Attention, Flash Attention, Batching, MQA, GQA & Parallelism techniques A typical article on this topic might start off by explaining key innovations like KV caching , Paged attention , Dynamic Batching , Flash attention , MQA , GQA etc. Instead, let us start by simply observing the LLM Inference process more closely. If we do a good enough job, we will be in a position to “ predict ” typical bottlenecks in the inferencing operation. Once we know what these bottlenecks are, we can discuss “ general ” strategies to fix these problems & only then see what are the Industry solutions. These solutions are surprisingly easy to grasp once the problems are well-understood! Let us start with the “ LLM Inference ” operation then. Basically “ Inference ” is what happens the moment after you enter your prompt and until the LLM generates the output. There is lot of material on how Transformers ( the underlying architecture of LLMs ) work & how LLM’s generate output, but let us quickly summarize the general flow in plain English. LLM Inference Process Explained After the prompt is submitted, we enter what is called the Prefill stage. During this Prefill stage, the prompt is broken into tokens . This operation happens on the CPU as it involves string manipulation & dictionary lookups, which are not well-suited for the GPU. (Note: I sometimes interchangeably use the terms “tokens” and “words” while going with the flow. Likewise, “LLM” and “Model” are interchangeably used. The tensors (numerical representations) corresponding to these tokens are sent to the GPU where the LLM model is eagerly awaiting them. The tensors move thru’ the model’s layers. This is the forward pass that we are familiar with. In simple terms, this means the i/p is transformed by each neural layer’s weights & sent to the next layer. The process repeats until the final layer which outputs a Probability distribution over the entire dictionary on what the “predicted word” could be. This Probability distribution & the LLM settings are used to select a single word from the dictionary. This becomes the first LLM generated word. This ends the prefill stage. The time taken for this stage is called the T ime T o F irst T oken (TTFT). In the prefill-stage, the GPU compute is well utilized since multiple operations like key , query , value computation etc can be done in parallel for all the input tokens. Performance is therefore bound by the compute power available . Note : key , query , & values are 3 different numerical representations of the tokens. The transformer architecture uses this kind of representation because it works neatly during the “ attention ” operations that happen inside its layers. This article assumes a basic understanding of transformer architecture & the attention operation for understanding certain parts. So far we have generated one word and taken TTFT time to do so. Next comes the Decode stage. The rest of the words of the LLM answer are generated in this Decode stage in an auto-regressive way, which initially sounded like a derogatory term to me but it actually just means that the output is generated one word at a time! So one can’t parallelize this to generate the output all at once. The Decode phase works as follows: The model uses the first token generated after the Prefill stage and the user entered prompt to generate the 2nd word of the response. A forward pass across the model is needed for this to happen. The process above repeats for the 3rd word & so on… So the Decode stage involves multiple forward passes thru’ the model, each dependent on the previously generated tokens & the original prompt. The speed of this stage influences the average T ime P er O utput T oken (TPOT) metric. This brings us to the all-important latency metric. LLM Inference Metrics We can only optimize what we measure. Let us start with Latency . It is the delay between a cause & its effect. In this context, it is the delay between an user entering the prompt & the LLM generating the full response. LLM latency is just TTFT + TPOT * ( num of o/p tokens — 1) . Minus one is done to exclude the first token. Your TPOT calculations should not consider TTFT. Another important metric is the Throughput . In general, it measures the amount of work a system can do in a given time period. In our case, the Throughput is simply the number of output tokens per second an inference server can generate across all users and requests! Latency gives us an idea of the performance w.r.t time taken for a particular user whereas Throughput gives us an idea of the capability of the system as a whole to handle voluminous traffic. This is all the metrics we need to study as far as LLMs are concerned (well, almost). As you may have already guessed, the Decode stage is where we could run into bottlenecks due to this whole auto-regressive business. Unlike the Prefill stage which was compute-bound , the bottleneck in the Decode stage is on the memory side whereas the compute is usually under-utilized ( for now, just take my word for it. As we progress, we see precisely why this happens ). So what can we do apart from twiddling our thumbs while tokens are generated one at a time in the Decode stage? LLM Inference Optimization — Key strategies Well, we can do — Batching . We need not send one user-request at a time to the model. Based on the memory available, we could group several “different” user-requests & then send them jointly thru the forward pass in a batch. We then get one output token for each individual user-request in one forward pass. The challenge, of course, is that user requests may arrive at different times & we would have to wait for N requests before we run them all in a batch across a single forward pass. In any case, different styles of Batching is a promising direction worth exploring. What else can we do? Let us go back to the auto-regressive nature of the Decode stage. We have a forward pass which generates a single o/p token. In the next forward pass, this newly generated token also needs to be considered along with the user-prompt. This process repeats till an EOS token is generated which implies that the response is completely generated. We know that a forward pass involves computation of key , value & query of the tokens which is used internally by the model’s attention mechanism. Instead of computing them each time for every forward pass, can we cache them somewhere? Enter KV Cache . Lastly, let us take a step back and look at the situation holistically. In modern GPUs, the computation itself is always fast due to the availability of parallel Cores. The bottleneck is usually the time it takes to load these computation Cores with the next layer of operations to perform. Loading involves moving tensors here & there. More memory means lesser movement. Any hack that helps reduce our memory footprint can improve throughput. So reducing memory footprint is another promising direction to explore & we look at PagedAttention & FlashAttention. There are smaller architectural hacks like MQA, GQA which also reduce the memory foot print. Let us dive into each of these promising venues & see how they can help speeden up our LLM. Later, we also discuss a few miscellaneous optimization techniques & also look at which LLM serving framework supports what feature. Let us start with KVcache! KV Cache Optimization for faster LLM Inference The Attention computation in a forward pass is costly. The key , value & query embeddings need to be generated for each token in the prompt & Attention operations performed on the same. Once a token is generated by the LLM, it is appended back to the sequence and the next forward pass considers this token as well in the Attention operations. This continues until EOS token is generated. Here, the idea is to cache the Key , Values generated so far, instead of calculating them over & over again. This cache consumes some memory, but yields huge benefits since a lot of operations can be skipped. Hey, Why leave the poor Query embedding out of this scheme? It makes sense to do a small dive into how a typical next token is generated in an LLM. We particularly want to know what happens after the model’s last layer & just before the token is spit out. We didn’ t cover this in the LLM flow at the beginning of this article because this needs an understanding of Attention operation. Those wanting a lighter read can choose to skip this sub-section. Let us work backwards from the model output which is a probability distribution over all words in the dictionary. How is this probability distribution generated? From the ( last layer of the model + a softmax operation ) which takes a Context as input. What is this Context ? It is not to be confused with the “ Context ” used in RAG solutions. The Context vector we are referring to here, is the one used in the general GPT architecture. It is the “ weighted” sum of all Value vectors of tokens in the sentence so far (including the current token). Where do these “ weights” come from? These weights are the Attention scores of the latest token w.r.t all prev tokens (including itself). And, pray how are these Attention scores calculated? This is simply: Q . K , where Q is the Query token embedding of the latest token and K is the respective Key embedding of all tokens in the sequence so far. So, to predict the next word, we need the: Value vector of all tokens in the sequence thus far Key vector of all tokens in the sequence thus far Query , Value & Key vector of the current token So past Query values are not used & need not be stored. Yippee, one less headache! By design, older tokens DON’T attend to the newly generated token unlike in a self-attention architecture where every token attends to every other token in the sequence. GPT simply doesn’t follow this pattern. Only the latest token attends to all the previous tokens (& not vice versa). Attention here always looks backwards & not forwards. This is causality. Optimizing KV Cache Memory So caching Key-Values help a lot because we keep doing forward passes repeatedly. The flip side is that saving KV cache needs memory… lots of it! We mitigate this by various strategies. Some of these strategies work during the inferencing process, others need to be baked in during training stage: Models can be trained not to pay “ attention ” to the whole sequence & instead focus only on last X tokens ( S liding W indow / local attention ) We could discard tokens that contribute little to the generation process. It has been empirically seen that certain tokens “ attend ” & impact weights disproportionately to other tokens. Such tokens have the maximum say in the results & the others can simply be neglected from the Attention calculations without impacting accuracy much. Implementing various Cache eviction policies (say time based etc). Reusing the KV-Cache across requests . This would apply when the prompts share a common prefix , which commonly occurs in multi-turn use cases like chat or when using prompt templates. Storing KV Cache data with lower precision. If Beam search is used in the inference, then one can allow sharing of KV tokens other beam candidates . In Beam search, we generate a small sequence of tokens, instead of generating a single token at a time! Yes, instead of simply choosing the single most likely next token at each step (greedy search), beam search selects a set of the most promising partial sequences, known as “ beams ”. Note: In scenarios where latency is important, such as live chatbots, beam search may introduce delays. KV Cache memory needs & hyper-parameter Settings The memory allocation needs for a KV Cache are straight forward. Assuming half-precision, Total size of KV cache in bytes = 2 times batch_size * sequence_length * num_heads * num_layers * num_dimensions * sizeof (FP16) . The multiplication by 2 is to account for the Keys and the Values . num_heads is discussed later (for those who may not be familiar with the transformer architecture). The Cache size can easily blow up for large models & hence the hacks discussed above are really useful on the field. KV Caching is controlled by the use_cache boolean parameter in HF Transformers & similar settings in other LLM serving frameworks. The good news: Most LLM serving frameworks support & turn on this feature by default. With KVcache, Attention operations scale linearly with the sequence length & not quadratically, easily leading to 2–4X speed improvements. PagedAttention optimization in LLMs Now that we know KV cache management is the key to a better LLM performance, let us explore a memory related innovation. This idea helps us enable even larger KV caches, thereby further improving performance. What is this idea? Think of how you want to reserve memory for storing KVCache data. Typically, we would not know the prompt length in advance. We also do not know how big the final response to be. A standard practice until recently, was to reserve contiguous memory chunks capable of fitting to the maximum possible sequence length for the KV cache. The challenge here is that, most times, parts of these memory chunks go waste. The other challenge is that due to the auto-regressive nature of the decoder, memory is consumed slowly over time & reserving large chunks right from the beginning makes no sense. What can we do? Enter PagedAttention ! PagedAttention is inspired by the idea of virtual memory and paging in operating systems. Instead of allocating a contiguous space in the memory for a request’s KV Cache, the memory is allocated in blocks dynamically. The blocks are not necessarily contiguous in the inference cluster’s physical memory. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. Whenever needed, the PagedAttention kernel identifies and fetches the relevant blocks efficiently. The physical blocks are allocated on demand as new tokens are generated. PagedAttention achieves a near-zero waste in KV cache memory thus enabling us to store larger sequences given the same amount of memory than before. It was first implemented by the vLLM inference system in 2023 but is now supported by all the major inference frameworks. This is usually turned ON by default in most servers & there is no need to do anything more. Batching strategies for LLM Inference Optimization Batching, as saw, improves inference speeds drastically. The challenge, of course, is that requests may arrive at different times. What to do? Static Batching or Request level Batching A naive batching strategy would simply make earlier requests wait until the batch is full & then start the feed forward process. This can lead to queueing delays. This kind of batching is called static batching or request-level batching i.e., they pick a batch of requests and execute it until all requests in the batch complete. Apart from the queuing problem, there is also the problem of lengthy answers. In a batch, there could be requests which have lengthy answers. New requests have to wait till the lengthiest request in an existing batch is completed before the next batch of requests can be loaded. Continuous batching A solution is a more fine-grained Batching mechanism. It was Orca who introduced a mechanism where requests can dynamically enter and exit a batch after each “ iteration ”. An iteration here is a single forward pass that generates exactly one new token for every active request in the batch. Unlike traditional batching methods that work at the request level & do not change the members of the batch till all output is generated for every request , these techniques operate at the iteration level. At each iteration , check to see if any request emits an EOS token indicating the end of the generation for that request. If so, let a new request take its place. So at every iteration , requests may be removed/added to the batch. Thus, we have a continuous batching going on. Note: The semantics (continuous vs dynamic vs in-flight batching) may vary slightly but the general principles are covered above. The good news: Dynamic batching is supported by most LLM inference serving systems today whether VLLM, TensorRT-LLM etc. However truth be told, we have over-simplified it. Let us go back to our requests which can either be in the prefill stage or in the decode stage. I also hope you recall the two metrics we discussed earlier — latency & throughput. As discussed, the prefill request uses compute cores effectively & has different bottlenecks when compared to decode phase. Batching is highly effective for decode , but provides little benefit for prefill . Naturally we should exercise some caution when batching them together. Usually this is done via hyperparameter: waiting_served_ratio , or the ratio of requests waiting for prefill versus those waiting EOS tokens (in decode stage). We also have innovations like prefill-prioritizing scheduling which as the name clearly suggests involves prioritising the prefill requests. This results in better throughput because it allows subsequent decode phase of requests to operate at high batch sizes. However, this approach however leads to high latency because it de-prioritises the decode phase of requests. Since prefills can take arbitrarily long time ( depending on the lengths of the prompts ), this can lead to stalling. Chunked prefill addresses this by splitting long prompts into smaller chunks & interleaving their prefills with decode . A slightly more complex alternative to iteration batching is to perform batching at the granularity of subgraphs in the dataflow graph. As each request arrives, its computation graph is broken into a graph of cells & an algorithm run to dynamically decide the set of cells that should be batched. FlashAttention for LLM Inference optimization It is a technique for throughput improvements by re-organizing the attention computation to require less memory-IO. The original paper showed 15% efficiency in terms of wall-clock speed with NO approximation i.e. no deterioration in output quality. To understand FlashAttention beyond the settings, we need to understand the basic anatomy of a GPU & observe the interesting things that happen inside a GPU during an LLM inference operation. I couldn’t get hold of a good article for this, so I actually paused this (current) article for several months while I wrote about GPU Architecture & Working intuitively explained . Flash attention basically optimizes traffic on the SRAM-HBM highway. Flash attention is IO-aware — a term which literally means it knows what is being transferred back and forth from memory & is designed to avoid redundant HBM read/writes. By using Flash attention, we reduce memory usage from quadratic to linear in sequence length. How is this done? A transformer typically has the Attention math, some Feed Forward math & some misc operations like softmax, dropout etc. It is some of the misc operations (like Softmax) which cause a memory-bound & take time. A core drive behind FlashAttention is to accelerate the computation of the Softmax function. The inspiration is an old concept called “ tiling ”. At the end of the day, a matrix is a 2D layout of individual tiles of adjacent rows & columns. This allows us to split up the larger operations into blocks of key, query & values computations and through some mathematical tricks, compute each softmax individually & later combine the individual pieces. So the main trick here is to load blocks of inputs from HBM to SRAM, perform attention with respect to that block & use the tiling tricks to calculate & combine the softmax in the SRAM instead of repeatedly moving large intermediate matrices from SRAM to the slower HBM memory & vice-versa. This optimization reduces memory bottlenecks, allows models to handle longer sequences & speeds up training & inference. Flash Attention was released in 2022. The latest iteration, Flash Attention 3, incorporates enhancements specifically designed for NVIDIA’s Hopper GPUs & Flash Attention 4 is on its way for the Blackwell GPUs. While Flash attention is now supported & set by default in nearly all LLM serving frameworks, there could be differences in the version supported. Latest PyTorch has native support for Flash Attention 2 & it can be directly used in your PyTorch models by selecting Flash Attention as the attention mechanism in the Scaled Dot Product Attention backend. Likewise, HF Transformers supports Flash Attention for many models with attn_implementation = “flash_attention_2” setting when initializing a model. Flash Attention 2 is enabled by default for vLLM as of v0.1.4 as well as TGI. MQA & GQA Architectural Optimization tricks Let us now look at a small but very effective hack to conserve memory. We have MQA where the same set of K eys and Values are shared across multiple Heads while maintaining different Queries for each Head . What are these Heads & what does this mysterious statement mean? For those who may not be aware of the transformer architecture, let me paint a simplified picture. The attention computation operations are replicated to improve the output. Each such attention computation is done with a different Key , Query , Value pair. So an input token does not have a single Key , Query , Value vector associated with it. It actually has many Key , Query , Value vectors associated with it. A 16 Head model has 16 Key , Query , Value vectors for a single token. The formal term is MHA ( M ulti H ead A ttention). But why do we need so many Heads doing the same Attention business ( but with different numerical representations of keys/queries/values? ). Well, each Head learns some different aspect of the language, grammar & meaning and together all of them contribute to generating the next word. Back to MQA — Here, we simplify and let all Heads use the same Key & Value values. But each Head still behaves differently since the Query is different. So the attention patterns generated for a sequence varies from Head to Head . This means, each Head learns something different & they all come together at the end to generate the next word. The advantage: We need to store only one Key & Value pair per token in the KV Cache irrespective of the number of Heads. The multiplcation by num_heads in the KVCache formula discussed earlier disappears & we can declare larger Caches leading to faster inferences. The disadvantage: It can lead to mild quality degradation since we are re-using the same Key & Value across all Heads . Mildly edited version of Fig 2 of the original GQA Paper — https://arxiv.org/pdf/2305.13245v2 To address this, GQA was introduced. Here we have N groups (of heads), each sharing a single Key, Value . N is >1 and is Heads . So, here we are walking the middle path between MQA and a full blown M ulti- H ead A ttention (MHA), achieving a balance between quality & speed. A GQA where N=1 is equivalent to MQA. When N = Number of Heads , it is equivalent to an MHA. We can set an N somewhere in between these extremes. Since its introduction in late 2023, GQA has been adopted in several popular models. Observe carefully, how GQA & MQA are model-level settings & cannot be simply turned ON or OFF by a setting in the LLM serving framework. In other words, the model is trained in a GQA or MQA mode i.e. the optimization hacks are baked in the training stage itself! Misc LLM Optimization Techniques There are many similar solutions & concepts which improve inference. Quantization We could compress LLMs while ensuring their performance does not deteriorate. This allows us to use larger LLMs within the memory available. Larger LLMs mean better output. Quantizaton was discussed in detail here . Distillation Another approach is to transfer the LLM knowledge to a smaller model through a process called distillation . This process involves training a smaller model (student) to mimic the behavior of a larger model (teacher). Pruning Here, we use techniques to drop the un-necessary weights from an LLM. S peculative decoding Here, the concept is to use a smaller, faster model alongside the main large LLM. The smaller model generates multiple token completions in parallel, acting as some sort of predictive aid to the larger, slower model. The larger model validates these (multiple) potential sequences, potentially accepting several tokens at once, rather than calculating each individually. If the small model’s predictions align well with what the large model would generate, then several forward passes through the large LLM are saved. Fig 1 of https://arxiv.org/pdf/1811.03115 RoPE Lastly, we come to RoPE . It is more of a output Quality optimization than a Speed improvement technique. So, maybe this should’nt be here, but a quick jist can’t harm. Let us start with a situation — What happens when input sequence to the model is longer than the longest text sequence it was trained on? It has been shown that (certain) LLMs perform poorly & Perplexity increases. The hacks involved to fix this fall under the category of Position Interpolation , a method to increase the ability of models to process long sequences by interpolating positional information based on the training context. This is important since many practical applications on the field involve long input sequences. RoPE is a training technique to produce better outputs during inference by allowing for Model Extrapolation. One other important technique needs to be covered. Parallelism. Inference Parallelism in LLMs If the model is small enough, it can run on a single GPU & distributed inference is not needed. Of course, this isn’t a practical situation. Quite often, a model may be too large for a single GPU but fits on a single node with multiple GPUs . What is this node business? A node is simply an unit having multiple GPUs, CPUs, RAM, local storage & high speed network interfaces for communication. There are usually 4-8 GPUs in a node . Models that are too large for a single GPU but small enough to fit on a node , can be inferenced on a node using tensor parallelism (discussed shortly). But it we consider Llama 3.1–405B model released last year with just half a trillion parameters, this model is not going to fit on a single node . We can try reducing the model size with quantization, distillation, pruning & all other hacks we learnt earlier, but we will still likely have trouble fitting this onto a node . What to do? We can form a GPU cluster consisting of several GPU nodes & let loose the Llama model on this cluster using tensor parallelism & pipeline parallelism . So how does this all work? Pipeline Parallelism LLMs are deep models with multiple neural layers stacked on one another & data flows sequentially thru’ these layers in a forward pass . In Pipeline parallelism, we simply split the model sequentially, with each GPU ( we call it Device here on ) handling a set of layers. You may have guessed the limitation already — yes, idle times, when a Device waits for its predecessor to finish. Tensor Parallelism Tensor parallelism, on the other hand, splits single individual layers across multiple Devices , parallelizing tensor operations within a single layer. The disadvantage of Tensor parallelism may not be that obvious. It demands a fair bit of communication between Devices. You see, the input tensors are split into chunks along a particular dimension such that each Device only holds 1/N chunk of the tensor. Computation is performed using this partial chunk to get partial output. These partial outputs are collected from all Devices and need to be combined. This demands a fair bit of additional inter-Device communication & processing & this is its Achilles heel. Deep models (e.g., GPT-4) benefit from pipeline parallelism whereas wide models with large neural layers (LLaMA) benefit from tensor parallelism. Data Parallelism Here we deploy multiple copies of our LLM model on different GPUs. Each copy independently processes separate user requests, thereby increasing availability. Of course, the model needs to be small for this to be possible. Expert Parallelism This is more of a model design pattern & often implemented as a M ixture o f E xperts (MoE). In this technique, the model is divided into multiple expert sub-networks. Each expert is trained to handle specific types of inputs. A router determines which expert to use for each input. Only a subset of experts is activated for any given input. Different experts can be distributed across different GPUs. Inactive experts don’t consume computational resources. Mixtral 8x7B, Switch Transformers are examples of MOEs. Sequence Parallelism This addresses the problem of very large inputs. Suppose we want to feed a very large document as a context to the model along with the user query. Though the model fits into the GPU, it is still possible to run into memory issues when processing the large context. Sequence parallelism splits long input sequences across multiple Devices, with each Device processing a portion of the sequence (through the entire model). Ring attention is the key. In Ring attention , each Device computes a local attention & these are later combined to calculate the total attention. Ring communication is carefully designed to make this happen with minimal overheads. KV sub-blocks are passed between Devices through a p2p communication. Each processor exchanges information only with its predecessor & successor, forming a Ring-type network. This allows intermediate results to be efficiently transmitted between processors without global synchronization. Hybrid Parallelism These parallelism strategies are not necessarily incompatible with one another & can be mixed to yield hybrid parallelism . Since communication between Devices on a single node is much faster than communication between nodes, it therefore makes sense to group tensor parallel replicas on a single node, and use pipeline parallelism across nodes. LLM Inference Optimization: Putting it all together We started by observing the LLM Inference process intimately. We could immediately identify some potential bottlenecks, given the flow. A few intuitive solutions were discussed & later elaborate industry solutions around these approaches were outined. Most of these solutions involved run-time optimizations, some involved architectural changes during training & yet others (like Quantization) involved changes after training but prior to inference. As the reader may have already observed, a lot of these solutions are available in most LLM serving frameworks & turned ON by default. Of course, one still needs to play around with the settings. Below are some of the settings in the vLLM engine. How many of these can you make sense of & surprise your AIOPS engineer in the process: max-num-seqs : Maximum number of Sequences allowed in a Batch gpu-memory-utilization: The fraction of GPU memory reserved for model weights, activations , and the KV cache. Default: 0.9 (90%). This parameter directly dictates the maximum size of the KV cache. block-size : PagedAttention — The size of the KV cache blocks tensor-parallel-size : Number of GPUs to shard the model across flash_attn_version : Force vllm to use a specific flash-attention version Many More: kv-cache-dtype ( auto or fp8); pipeline-parallel-size; quantization; speculative-model ( model used for speculative decoding) What remains to be covered are the LLM serving strategies & popular LLM serving engines in the market, which will be the last article of this series. Other articles of mine explaining AI in a simple, intuitive & fun way. Advanced RAG Techniques & Concepts: Summary of a 1000 papers HNSW — Story of the world’s most popular Vector search algorithm Understanding LLM Agents: Concepts, Patterns & Frameworks In-Context learning: The greatest magic show in the kingdom of LLMs Anatomy of a GPU — A peek into the hardware fuelling LLM operations LLM Quantization — From concepts to implementation LoRA & its newer variants explained like never before LLM Inference Optimization: Inference Process -> ChokePoints -> Spontaneous Solutions -> Industry techniques Probabilistic AI series explained to the intrepid software coder. Less Math, More intuition & an emphasis on practical applications of the concepts. Random Variables & Probability Distributions explained — Intuition, Basic Math & application in AI/ML A deeper dive into Distributions — Intuition & application in models like Stable diffusion & GMM clustering The Enchanting world of GNNs — MPNNs, GCN, GAT, GTN, GraphSAGE Of Poets, Pragmatists & the Practitioner — Perspectives on Ridge, Lasso, Dropout regularization from a Bayesian lens MCMC & the magical art of Sampling without Sampling — Story, Intuition & the gentle Math behind the greatest algorithm of the 20th century Secrets of the VAE — An appreciation without the apprehension Lastly, Cosine Distance vs Dot Product vs Euclidean in vector similarity search — Why my webpages will never be ranked at the top of a search engine result (well, looks like the folks at Google are having the last laugh. This is the only article of mine that has a decent search rank!) For the yet-to-be-authored pieces, you can check directly here . LLM Inference Optimization was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Similar Posts