Crushing ML Latency: The (Un)Official Best Practices for Systems Optimisation

12 min read19 hours ago

–

TLDR: This guide offers a concise, accessible, intermediate-level overview addressing training latency issues for ML researchers / programmers. It covers best practices for systems optimisation, while emphasising (especially for newcomers) the crucial details that LLM-generated code often overlooks, with a special focus on MLOps and ML systems engineering. I paid particular attention to performance profiling and optimisation.

Press enter or click to view image in full size

Do you remember the internet in the 90s?

Some of you do (including myself).

12 min read19 hours ago

–

Press enter or click to view image in full size

Do you remember the internet in the 90s?

Some of you do (including myself).

It took AGES to open a website (but it was still super exciting!). For the programming newbies who have only started exploring ML, that’s probably how the latency problems feel like. For the more advanced coders, the wait time is the pain in the backside and requires systems optimisation…

When deploying a large model, the priorities shift from learning to efficient inference. Before a model goes live, engineers often apply techniques that shrink its footprint and cut hardware costs while preserving accuracy. Knowledge distillation, pruning, and quantisation are all aimed at making predictions faster, lighter, and more energy efficient [1].

Learning about how to address latency is crucial on the path to become an ML engineer, and many people do not know where to start. Perhaps choose the trendy path, try with an LLM???? Unfortunately, most coding assistants won’t help much, LLMs aren’t designed to reason about systems performance. You need to find elsewhere to get this information. Realistically, there is a reason why the “vibe coding” era has created a new niche which is vibe coding fixers, engineers who still understand what’s going on under the hood.

So, how do you fix a model with training latency issues, when the process feels longer than an A&E waiting time on Saturday night?

Irrespective of what framework you are working with, performance matters.** It is **not just about basics like accuracy, but system performance. The difference between a 3-hour training job and a 3-day one is the difference between iteration and stagnation.

Waiting for training to finish. Waiting for code to compile. Waiting for your budget to recover from last month’s GPU bill… Welcome to ML latency hell.

So how could we decrease the waiting time?

One of the main strategies is to increase parameter sparsity [2], resulting in** **reducing the number of non-zero weights and biases. However, it is important to note that sparsity can reduce compute only if the framework and hardware support sparse kernel acceleration [3,4].

Sparse parameter matrices demand less bandwidth and computation, because operations involving zeros can be skipped [5]. This property enables quicker matrix multiplications and is especially advantageous for applications where low latency or real time response is critical [5]. However, there are so many other ways to address this!

This post is your no-nonsense guide to optimising ML systems: where to start, what to measure, and how to get your pipeline flying without losing your sanity (or your team’s goodwill).

1. Optimisation isn’t a checklist. It’s a cycle.

Most experienced developers would jump straight into tweaking hyperparameters (makes perfect sense) or rewriting half of their model in C++ (eh, the lucky ones who actually know what they are doing in C++). But unless you are working on something with extreme computational requirements (that’s when you would need C++ due to its compiled, low-level language), you rarely need to rewrite whole models in C++. Transitioning to C++ would be like fixing a leaky pipe by tearing down the whole kitchen. Realistically, using C++ or CUDA for targeted kernels can help when a Python loop or custom op is clearly the bottleneck, but it is not always needed.

So what should you try instead?

What you need is a strategic, iterative loop approach.

The Performance Loop [6].

The Performance Loop* Profile → Strip Down → Fix → Repeat (Then go outside. Seriously.)*

Step 1: Understand Where the Pain Is (Profiling, but Smarter)

Don’t try “optimising” before you know what needs fixing. Guessing is like gambling, unnecessary, and it can turn out quite expensive.

Layered Benchmarking

Think of profiling as triage.

Layered benchmarking is a structured way to profile and optimise training latency by moving from a broad, system-level view to progressively finer levels of detail. It helps you find the real bottleneck instead of guessing.

Macro (System Level) Metrics

Start with coarse metrics to see if the training loop is underperforming overall.

Throughput: samples/sec or batches/sec.

Batches/sec: Are we processing fast enough?

#basic code for a counterimport timestart = time.perf_counter()for i, batch in enumerate(train_loader):    ...  # training step    if i % 50 == 0 and i > 0:        elapsed = time.perf_counter() - start        print(f"{i/elapsed:.2f} batches/sec")

There are also framework-native solutions (check out PyTorch Profiler, TensorFlow Profiler Experimental Trace).

FLOPS vs Theoretical Peak: Are you using your hardware or just warming it up?

(Important note: many tutorials recommend **FlopCountAnalysis, but that **is now somewhat outdated. Generally speaking, torch.profiler [7] and **torch._dynamo [8] **offer better metrics in PyTorch 2+.)

(for nvidia, try looking into nvidia-smifor specs)

#in PyTorch you can attempt basic FlopCountAnalysis()>>> model = TestModel()>>> inputs = (torch.randn((1,3,10,10)),)>>> flops = FlopCountAnalysis(model, inputs)>>> flops.total()

'''for tensorflow launch TensorBoard and point it to logdir. then Open your browser and go to localhost:6006/#profile to view profiling results.'''options = tf.profiler.experimental.ProfilerOptions(host_tracer_level = 3,                                                   python_tracer_level = 1,                                                   device_tracer_level = 1)tf.profiler.experimental.start('logdir_path', options = options)# Training code heretf.profiler.experimental.stop()

Check the GPU/CPU Utilisation: Under 90%? You’ve got idle time.

Still did not find your problem? Time to dive deeper then!

Micro Profiling

Timing: Try :

a) cProfile

“A profile is a set of statistics that describes how often and for how long various parts of the program executed. These statistics can be formatted into reports via the [pstats](https://docs.python.org/3/library/profile.html#module-pstats) module.”

b) PyTorch Profiler

c) or your framework’s built-ins.

Memory:

memory_profiler(package no longer maintained but some people still use it); simple @profile decorator & CLI plots.

#To get the line by line profiling, # add decorator ‘@profile’ in front of the def function.from memory_profiler import profile@profiledef my_func():    ... # some random function of your choiceif __name__ == '__main__':    my_func(), tracemalloc, or TensorBoard.

**tracemalloc**** (stdlib)** tracks Python allocations with traceback; great first pass.

Scalene provides per-line CPU and memory (incl. Python vs native) + GPU memory; low overhead.

**Pympler supplies **object summaries (asizeof, growth over time); find bloated types.

**objgraph allows for exploring **reference cycles / who’s keeping objects alive.

**Fil (filprofiler) is able to find **leaks and shows allocation call trees with file/line heatmaps.

Network: Even local networks can bottleneck. (I like timing I/O functions over setting up Wireshark.)

Bonus? Use continuous profiling (like Pyroscope) in production to catch regressions early.

Step 2: Break it up into chunks to fix the system

It is just easier to do it step by step.

Simplification: The art of reduction

You want the smallest, simplest, or maybe even dumbest version of your system that still shows the problem.

Try this:

Stub out all unrelated modules.

Replace slow functions with sleep() or dummy ops.

Use synthetic data (zeroes = faster).

If possible, run everything locally in a single process.

Minimise: one actor, one GPU, one batch.

Essentially, it is highly relevant to have clarity regarding where the bottleneck is located. Cutting down to small chunks is not even about the speed and reducing memory requirements, it is mainly about the bottleneck detection.

Boom, by now you should have identified the bottleneck.

Step 3: Bottleneck identified? Time to attack (and ignore the rest).

Time to optimise.

Most bottlenecks fall into three buckets: compute [9], communication[10], and memory (including the famous von Neumann bottleneck [11]). Each has its own fix strategy.

Compute: speed up calculations, train smarter

Techniques:

Vectorisation > for-loops

Vectorisation helps replacing explicit loops with matrix and vector operations (so a chunk of code which is potentially faster, more readable, and efficient)[12].

Mixed Precision (FP16/BF16): Half the memory, faster ops [13, 14].

This is a training and inference technique that mixes numerical formats, typically 16-bit floating point for most tensor math and 32-bit for numerically sensitive operations , in order to boost speed and reduce memory without sacrificing accuracy. Modern GPUs accelerate 16-bit matrix multiplies on “tensor cores,” letting you fit larger batches or models into the same VRAM while cutting latency and energy use. Two common 16-bit formats are FP16, which is very fast but has a limited exponent range and often requires loss scaling, and BF16, which trades a little precision for FP32-like range and usually trains stably without scaling. In practice, models run under an autocast context so heavy matmuls/conv use FP16/BF16, while softmax/log-probabilities, normalization statistics, and reductions stay in FP32 [13,14]. Done right, mixed precision delivers substantial throughput gains with minimal code changes and near-identical final quality. Basically, MAGIC.

Gradient Accumulation: Simulate large batches with less memory.

Gradient accumulation allows for training with an effective large batch while fitting only small “micro-batches” in memory: you run multiple forward/backward passes without calling optimizer.step(), summing gradients each time, then update once and zero them.

Gradient accumulation is a way to use a batch size that doesn’t fit in memory, and stabilises training in certain setups. For novices it is thus only useful in particular niche cases.

But remember (!!!) there’s mixed evidence on batch size: theory and some small-scale results suggest large batches approximate the full-dataset gradient and reduce stochasticity, while other studies report that small batches can generalise better [15, 16]. In practice at scale, we push batch sizes as high as hardware allows to maximise parallelism; the supposed small-batch advantages tend to vanish when you can train bigger models on more data, and larger batches (or their emulation via gradient accumulation) often smooth training on noisy datasets [17]. Ultimately, gradient accumulation is just one tuning knob: your effective batch size is set by per-device batch × accumulation steps × number of devices, and together with the learning rate it governs the optimisation dynamics.

Caching: Precompute expensive features or masks.

This way, your model can simply “look up” during the training process. This cuts input-pipeline latency, stabilises throughput, and frees GPU time from waiting on preprocessing.

Custom Kernels: Use TorchScript or even raw CUDA for critical ops.

“Whaaaaat, you just applied a single function call and it is guaranteed to speeds up the code? Seems too good to be true.”

Yeah. By using only one simple decorator, you can speed up without major changes in your code. Voilà!

@torch.compile

#literally, as simple as the torch documentation suggests# (adding @torch.compile)import torch t1 = torch.randn(10, 10)t2 = torch.randn(10, 10)@torch.compiledef opt_foo2(x, y):    a = torch.sin(x)    b = torch.cos(y)    return a + bprint(opt_foo2(t1, t2))

Important note: in certain cases some workloads may slowdown after @torch.compile (results vary by model, always benchmark and check if you are working with static/dynamic shapes).

@triton.jit is also an option.

Communication: a key to success, also in coding

Especially in distributed training, communication overhead can destroy scaling.

Techniques:

Stay local as long as you can (multi-node setup and the related distributed learning adds overhead, which can obscure the real bottlenecks).

Asynchronous everything: prefetching, IO, logging.

AllReduce Optimization: Use NCCL or Horovod with proper tuning.

Don’t move data unless you must: Keep tensors on the device where they’re used.

Memory: The Hidden Killer

Most people would not monitor their memory usage unless the process crashes or suddenly gets slower.

Techniques:

Downcast your data types (float32 → float16, int64 → int32)

When to avoid: things needing high dynamic range or exact arithmetic (money, timestamps, large accumulators, softmax/logsumexp/entropy math, running stats). In ML, a common pattern is compute in mixed precision but keep sensitive parts in FP32 (losses, log-probs, reductions).

Pre-allocate memory where possible (especially on the GPU)

Generators > Lists

If your object is huge, then during the calculations it would be lot of stuff to hold in memory. Therefore, it would be better to replace that list of lists with a generator of lists, allowing for holding only a small chunk in memory at the time. In this way, the peak memory load would be waaaaay smaller than when using a list of list.

Use shared memory if you’re running multi-processing.

Watch for memory leaks from holding onto large objects or circular refs.

Worth noting: In Python, memory leaks are rare. Python is extremely well written, so if you suspect one, it’s probably not the case. HOWEVER, if you do have one, then it is probably in some C++ dependent library. Additionally, although Python itself manages memory well, ML frameworks can easily create leaks via tensor references, autograd graphs, or lingering GPU tensors.

Feeling Brave? Advanced Power-Ups

Time to go full wizard mode!

Automated Optimisation

Optuna / Ray Tune: Not just for model hyperparams, worth trying system-level knobs like num_workers, batch size, or prefetch count.

Optuna/Ray Tune can directly target latency (not just accuracy) if you make latency the metric and search over system knobs. They won’t magic away a fundamentally slow model, but they do find faster configurations of the input pipeline, runtime and serving stack.

or …

you could also use Learned Schedulers

Hardware-Aware Training

TensorRT, ONNX Runtime, CoreML for optimised inference.

NVIDIA servers/edge → TensorRT; multi-platform product → ONNX Runtime; Apple apps → Core ML. Quantization and mixed precision help all three; always validate accuracy after conversion.

XLA (TPUs) or DeepSpeed for large-scale acceleration.

When to pick which:

You have TPU access (e.g., Google Cloud) and want compiler-driven speed → XLA/TPU.
You have one or more GPUs (workstation or cluster) and need to scale big models efficiently → DeepSpeed.

Gotchas: XLA likes stable shapes; DeepSpeed needs fast interconnects (NVLink/InfiniBand) and careful batch/accumulation tuning. In both cases, BF16/FP16 mixed precision is the norm for speed.

Target architecture-specific optimisations (e.g., AVX512 on Intel, ROCm on AMD GPUs)

Data Pipeline Acceleration

Use DALI or WebDataset for fast decoding and loading

(WebDataset is a PyTorch-friendly data format + loader for streaming large datasets from sharded TAR files (locally or over HTTP/S3) so training reads data sequentially instead of opening millions of tiny files. That solves the “small files” bottleneck, feeds GPUs faster in distributed training, and works as an IterableDataset/DataPipe in PyTorch (also usable from other stacks or with DALI).

Important note: Make sure to enable shuffling with WebDataset to preserve stochasticity across epochs.

Pre-tokenise or preprocess data offline.

Switch from CSV/JSON to Arrow, Parquet, or memory-mapped formats.

How Do You Know When You’re Done?

Trick question. Are we really ever done? Maybe not, but we should know when to stop.

Your system is “good enough” when:

Training/inference time doesn’t block experimentation
Cost is within acceptable bounds
Users (or your boss/team) are no longer complaining

Don’t optimise endlessly for fun (unless that’s your actual job). Speed is a means to an end, not the goal.

Real Talk

Building performant ML systems is a complex search through an unpredictable landscape, and it will cost you a lot in compute before you find the issue.

But with a looped, deliberate process , including **measure, reduce, fix, repeat, **you can cut through the chaos and ship faster, cheaper, cleaner systems.

TL;DR Cheatsheet

Press enter or click to view image in full size

Final Thought

You don’t need to be a genius to optimise ML systems, you just need a plan, some patience, and a healthy fear of large training bills.

Just stay calm, and carry on.

References

[1] Pierre Vilar Dantas, Waldir Sabino da Silva, Lucas Carvalho Cordeiro, and Celso Barbosa Carvalho. A comprehensive review of model compression techniques in machine learning. Applied Intelligence, 54(22):11804–11844, September 2024. ISSN 1573–7497. doi: 10.1007/s10489–024–05747-w. URL http://dx.doi.org/10.1007/s10489-024-05747-w

[2] Jesus Rios, Pierre Dognin, Ronny Luss, and Karthikeyan N. Ramamurthy. Sparsity may be all you need: Sparse random parameter adaptation, 2025. URL https://arxiv.org/abs/2502.15975

[3] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks, 2021. URL https://arxiv.org/abs/2102.00554.

[4] Gerasimos Gerogiannis, Dimitrios Merkouriadis, Charles Block, Annus Zulfiqar, Filippos Tofalos, Muhammad Shahbaz, and Josep Torrellas. Netsparse: In-network acceleration of distributed sparse kernels. In Proceedings of the 2025 58th IEEE/ACM International Symposium on Microarchitecture, MICRO 2025, page 958–974. ACM, October 2025. doi: 10.1145/3725843.3756076. URL http://dx.doi.org/10.1145/3725843.3756076.

[5] Jennifer Scott and Miroslav T˚uma. An Introduction to Sparse Matrices, page 1–18. Springer International Publishing, 2023. ISBN 9783031258206. doi: 10.1007/978–3–031–25820–6_1. URL http://dx.doi.org/10.1007/978-3-031-25820-6_1.

[6] NDC Conferences. (2025, March 14). Performance loop — A practical guide to profiling and benchmarking — Daniel Marbach — NDC London 2025. YouTube. https://www.youtube.com/watch?v=EY6BNuVKeRY

[7] PyTorch. torch.profiler — PyTorch documentation. Available at: https://docs.pytorch.org/docs/stable/profiler.html. Accessed: 30 Oct 2025.

[8] PyTorch. Dynamo Overview — PyTorch documentation. Available at: https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html. Accessed: 30 Oct 2025.

[9] Paul Elvinger, Foteini Strati, Natalie Enright Jerger, and Ana Klimovic. Measuring GPU24 utilization one level deeper. 2025.25

[10] Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola. Dive into deep learning. 2021.26

[11] Hess, P. Why a decades-old architecture decision is impeding the power of AI computing. IBM Research Blog, 2025. Available at: https://research.ibm.com/blog/why-von-neumann-architecture-is-impeding-the-power-of-ai-computing. Accessed: 30 Oct 2025.

[12] Why is vectorization faster in general than loops? Stack Overflow, 2016. Available at: https://stackoverflow.com/questions/35091979/why-is-vectorization-faster-in-general-than-loops. Accessed: 30 Oct 2025.

[13] NVIDIA Corporation. Train with Mixed Precision: User Guide. Available at: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html. Accessed: 30 Oct 2025.

[14] Alexandre Benoit. Speeding up MACE: Low-precision tricks for equivarient force fields. October 2025.

[15] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. 2017.

[16] Dominic Masters and Carlo Luschi. Revisiting small batch training for deep neural networks, 2018. URL https://arxiv.org/abs/1804.07612.

[17] Xin Qian and Diego Klabjan. The impact of the mini-batch size on the variance of gradients in stochastic gradient descent, 2020. URL https://arxiv.org/abs/2004.13146.

So how could we decrease the waiting time?

1. Optimisation isn’t a checklist. It’s a cycle.

Step 1: Understand Where the Pain Is (Profiling, but Smarter)

Layered Benchmarking

Macro (System Level) Metrics

Batches/sec: Are we processing fast enough?

Micro Profiling

Step 2: Break it up into chunks to fix the system

Simplification: The art of reduction

Try this:

Step 3: Bottleneck identified? Time to attack (and ignore the rest).

Compute: speed up calculations, train smarter

Techniques:

Communication: a key to success, also in coding

Techniques:

Memory: The Hidden Killer

Techniques:

Feeling Brave? Advanced Power-Ups

Automated Optimisation

Hardware-Aware Training

Data Pipeline Acceleration

How Do You Know When You’re Done?

Real Talk

TL;DR Cheatsheet

Final Thought

Similar Posts