Original version: 2023/07/27, last updated: 2025/12/16
Over the years, we (Jeff & Sanjay) have done a fair bit of diving into performance tuning of various pieces of code, and improving the performance of our software has been important from the very earliest days of Google, since it lets us do more for more users. We wrote this document as a way of identifying some general principles and specific techniques that we use when doing this sort of work, and tried to pick illustrative source code changes (change lists, or CLs) that provide examples of the various approaches and techniques. Most of the concrete suggestions below reference C++ types and CLs, but the gener…
Original version: 2023/07/27, last updated: 2025/12/16
Over the years, we (Jeff & Sanjay) have done a fair bit of diving into performance tuning of various pieces of code, and improving the performance of our software has been important from the very earliest days of Google, since it lets us do more for more users. We wrote this document as a way of identifying some general principles and specific techniques that we use when doing this sort of work, and tried to pick illustrative source code changes (change lists, or CLs) that provide examples of the various approaches and techniques. Most of the concrete suggestions below reference C++ types and CLs, but the general principles apply to other languages. The document focuses on general performance tuning in the context of a single binary, and does not cover distributed systems or machine learning (ML) hardware performance tuning (huge areas unto themselves). We hope others will find this useful.
Many of the examples in the document have code fragments that demonstrate the techniques (click the little triangles!). Note that some of these code fragments mention various internal Google codebase abstractions. We have included these anyway if we felt like the examples were self-contained enough to be understandable to those unfamiliar with the details of those abstractions.
The importance of thinking about performance
Knuth is often quoted out of context as saying premature optimization is the root of all evil. The full quote reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” This document is about that critical 3%, and a more compelling quote, again from Knuth, reads:
The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.
Many people will say “let’s write down the code in as simple a way as possible and deal with performance later when we can profile”. However, this approach is often wrong:
- If you disregard all performance concerns when developing a large system, you will end up with a flat profile where there are no obvious hotspots because performance is lost all over the place. It will be difficult to figure out how to get started on performance improvements.
- If you are developing a library that will be used by other people, the people who will run into performance problems will be likely to be people who cannot easily make performance improvements (they will have to understand the details of code written by other people/teams, and have to negotiate with them about the importance of performance optimizations).
- It is harder to make significant changes to a system when it is in heavy use.
- It is also hard to tell if there are performance problems that can be solved easily and so we end up with potentially expensive solutions like over-replication or severe overprovisioning of a service to handle load problems.
Instead, we suggest that when writing code, try to choose the faster alternative if it does not impact readability/complexity of the code significantly.
Estimation
If you can develop an intuition for how much performance might matter in the code you are writing, you can make a more informed decision (e.g., how much extra complexity is warranted in the name of performance). Some tips on estimating performance while you are writing code:
- Is it test code? If so, you need to worry mostly about the asymptotic complexity of your algorithms and data structures. (Aside: development cycle time matters, so avoid writing tests that take a long time to run.)
- Is it code specific to an application? If so, try to figure out how much performance matters for this piece of code. This is typically not very hard: just figuring out whether code is initialization/setup code vs. code that will end up on hot paths (e.g., processing every request in a service) is often sufficient
- Is it library code that will be used by many applications? In this case it is hard to tell how sensitive it might become. This is where it becomes especially important to follow some of the simple techniques described in this document. For example, if you need to store a vector that usually has a small number of elements, use an absl::InlinedVector instead of std::vector. Such techniques are not very hard to follow and don’t add any non-local complexity to the system. And if it turns out that the code you are writing does end up using significant resources, it will be higher performance from the start. And it will be easier to find the next thing to focus on when looking at a profile.
You can do a slightly deeper analysis when picking between options with potentially different performance characteristics by relying on back of the envelope calculations. Such calculations can quickly give a very rough estimate of the performance of different alternatives, and the results can be used to discard some of the alternatives without having to implement them.
Here is how such an estimation might work:
- Estimate how many low-level operations of various kinds are required, e.g., number of disk seeks, number of network round-trips, bytes transmitted etc.
- Multiply each kind of expensive operation with its rough cost, and add the results together.
- The preceding gives the cost of the system in terms of resource usage. If you are interested in latency, and if the system has any concurrency, some of the costs may overlap and you may have to do slightly more complicated analysis to estimate the latency.
The following table, which is an updated version of a table from a 2007 talk at Stanford University (video of the 2007 talk no longer exists, but there is a video of a related 2011 Stanford talk that covers some of the same content) may be useful since it lists the types of operations to consider, and their rough cost:
L1 cache reference 0.5 ns
L2 cache reference 3 ns
Branch mispredict 5 ns
Mutex lock/unlock (uncontended) 15 ns
Main memory reference 50 ns
Compress 1K bytes with Snappy 1,000 ns
Read 4KB from SSD 20,000 ns
Round trip within same datacenter 50,000 ns
Read 1MB sequentially from memory 64,000 ns
Read 1MB over 100 Gbps network 100,000 ns
Read 1MB from SSD 1,000,000 ns
Disk seek 5,000,000 ns
Read 1MB sequentially from disk 10,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
The preceding table contains rough costs for some basic low-level operations. You may find it useful to also track estimated costs for higher-level operations relevant to your system. E.g., you might want to know the rough cost of a point read from your SQL database, the latency of interacting with a Cloud service, or the time to render a simple HTML page. If you don’t know the relevant cost of different operations, you can’t do decent back-of-the-envelope calculations!
Example: Time to quicksort a billion 4 byte numbers
As a rough approximation, a good quicksort algorithm makes log(N) passes over an array of size N. On each pass, the array contents will be streamed from memory into the processor cache, and the partition code will compare each element once to a pivot element. Let’s add up the dominant costs:
- Memory bandwidth: the array occupies 4 GB (4 bytes per number times a billion numbers). Let’s assume ~16GB/s of memory bandwidth per core. That means each pass will take ~0.25s. N is ~2^30, so we will make ~30 passes, so the total cost of memory transfer will be ~7.5 seconds.
- Branch mispredictions: we will do a total of N*log(N) comparisons, i.e., ~30 billion comparisons. Let’s assume that half of them (i.e., 15 billion) are mispredicted. Multiplying by 5 ns per misprediction, we get a misprediction cost of 75 seconds. We assume for this analysis that correctly predicted branches are free.
- Adding up the previous numbers, we get an estimate of ~82.5 seconds.
If necessary, we could refine our analysis to account for processor caches. This refinement is probably not needed since branch mispredictions are the dominant cost according to the analysis above, but we include it here anyway as another example. Let’s assume we have a 32MB L3 cache, and that the cost of transferring data from L3 cache to the processor is negligible. The L3 cache can hold 2^23 numbers, and therefore the last 22 passes can operate on the data resident in the L3 cache (the 23rd last pass brings data into the L3 cache and the remaining passes operate on that data.) That cuts down the memory transfer cost to 2.5 seconds (10 memory transfers of 4GB at 16GB/s) instead of 7.5 seconds (30 memory transfers).
Example: Time to generate a web page with 30 image thumbnails
Let’s compare two potential designs where the original images are stored on disk, and each image is approximately 1MB in size.
- Read the contents of the 30 images serially and generate a thumbnail for each one. Each read takes one seek + one transfer, which adds up to 5ms for the seek, and 10ms for the transfer, which adds up to 30 images times 15ms per image, i.e., 450ms.
- Read in parallel, assuming the images are spread evenly across K disks. The previous resource usage estimate still holds, but latency will drop by roughly a factor of K, ignoring variance (e.g, we will sometimes get unlucky and one disk will have more than 1/Kth of the images we are reading). Therefore if we are running on a distributed filesystem with hundreds of disks, the expected latency will drop to ~15ms.
- Let’s consider a variant where all images are on a single SSD. This changes the sequential read performance to 20µs + 1ms per image, which adds up to ~30 ms overall.
Measurement
The preceding section gives some tips about how to think about performance when writing code without worrying too much about how to measure the performance impact of your choices. However, before you actually start making improvements, or run into a tradeoff involving various things like performance, simplicity, etc. you will want to measure or estimate potential performance benefits. Being able to measure things effectively is the number one tool you’ll want to have in your arsenal when doing performance-related work.
As an aside, it’s worth pointing out that profiling code that you’re unfamiliar with can also be a good way of getting a general sense of the structure of the codebase and how it operates. Examining the source code of heavily involved routines in the dynamic call graph of a program can give you a high level sense of “what happens” when running the code, which can then build your own confidence in making performance-improving changes in slightly unfamiliar code.
Profiling tools and tips
Many useful profiling tools are available. A useful tool to reach for first is pprof since it gives good high level performance information and is easy to use both locally and for code running in production. Also try perf if you want more detailed insight into performance.
Some tips for profiling:
- Build production binaries with appropriate debugging information and optimization flags.
- If you can, write a microbenchmark that covers the code you are improving. Microbenchmarks improve turnaround time when making performance improvements, help verify the impact of performance improvements, and can help prevent future performance regressions. However microbenchmarks can have pitfalls that make them non-representative of full system performance. Useful libraries for writing microbenchmarks: C++ Go Java.
Use a benchmark library to emit performance counter readings both for better precision, and to get more insight into program behavior.
- Lock contention can often artificially lower CPU usage. Some mutex implementations provide support for profiling lock contention.
- Use ML profilers for machine learning performance work .
What to do when profiles are flat
You will often run into situations where your CPU profile is flat (there is no obvious big contributor to slowness). This can often happen when all low-hanging fruit has been picked. Here are some tips to consider if you find yourself in this situation:
- Don’t discount the value of many small optimizations! Making twenty separate 1% improvements in some subsystem is often eminently possible and collectively mean a pretty sizable improvement (work of this flavor often relies on having stable and high quality microbenchmarks). Some examples of these sorts of changes are in the changes that demonstrate multiple techniques section.
- Find loops closer to the top of call stacks (flame graph view of a CPU profile can be helpful here). Potentially, the loop or the code it calls could be restructured to be more efficient. Some code that initially built a complicated graph structure incrementally by looping over nodes and edges of the input was changed to build the graph structure in one shot by passing it the entire input. This removed a bunch of internal checks that were happening per edge in the initial code.
- Take a step back and look for structural changes higher up in the call stacks instead of concentrating on micro-optimizations. The techniques listed under algorithmic improvements can be useful when doing this.
- Look for overly general code. Replace it with a customized or lower-level implementation. E.g., if an application is repeatedly using a regular expression match where a simple prefix match would suffice, consider dropping the use of the regular expression.
- Attempt to reduce the number of allocations: get an allocation profile, and pick away at the highest contributor to the number of allocations. This will have two effects: (1) It will provide a direct reduction of the amount of time spent in the allocator (and garbage collector for GC-ed languages) (2) There will often be a reduction in cache misses since in a long running program using tcmalloc, every allocation tends to go to a different cache line.
- Gather other types of profiles, specially ones based on hardware performance counters. Such profiles may point out functions that are encountering a high cache miss rate. Techniques described in the profiling tools and tips section can be helpful.
API considerations
Some of the techniques suggested below require changing data structures and function signatures, which may be disruptive to callers. Try to organize code so that the suggested performance improvements can be made inside an encapsulation boundary without affecting public interfaces. This will be easier if your modules are deep (significant functionality accessed via a narrow interface).
Widely used APIs come under heavy pressure to add features. Be careful when adding new features since these will constrain future implementations and increase cost unnecessarily for users who don’t need the new features. E.g., many C++ standard library containers promise iterator stability, which in typical implementations increases the number of allocations significantly, even though many users do not need pointer stability.
Some specific techniques are listed below. Consider carefully the performance benefits vs. any API usability issues introduced by such changes.
Bulk APIs
Provide bulk ops to reduce expensive API boundary crossings or to take advantage of algorithmic improvements.
Add bulk MemoryManager::LookupMany interface.
In addition to adding a bulk interface, this also simplified the signature for the new bulk variant: it turns out clients only needed to know if all the keys were found, so we can return a bool rather than a Status object.
memory_manager.h
class MemoryManager {
public:
...
util::StatusOr<LiveTensor> Lookup(const TensorIdProto& id);
class MemoryManager {
public:
...
util::StatusOr<LiveTensor> Lookup(const TensorIdProto& id);
// Lookup the identified tensors
struct LookupKey {
ClientHandle client;
uint64 local_id;
};
bool LookupMany(absl::Span<const LookupKey> keys,
absl::Span<tensorflow::Tensor> tensors);
Add bulk ObjectStore::DeleteRefs API to amortize locking overhead.
object_store.h
template <typename T>
class ObjectStore {
public:
...
absl::Status DeleteRef(Ref);
template <typename T>
class ObjectStore {
public:
...
absl::Status DeleteRef(Ref);
// Delete many references. For each ref, if no other Refs point to the same
// object, the object will be deleted. Returns non-OK on any error.
absl::Status DeleteRefs(absl::Span<const Ref> refs);
...
template <typename T>
absl::Status ObjectStore<T>::DeleteRefs(absl::Span<const Ref> refs) {
util::Status result;
absl::MutexLock l(&mu_);
for (auto ref : refs) {
result.Update(DeleteRefLocked(ref));
}
return result;
}
memory_tracking.cc
void HandleBatch(int, const plaque::Batch& input) override {
for (const auto& t : input) {
auto in = In(t);
PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles());
for (const auto handle : handles.value->handles()) {
PLAQUE_OP_RETURN_IF_ERROR(in_buffer_store_
? bstore_->DeleteRef(handle)
: tstore_->DeleteRef(handle));
}
}
}
void HandleBatch(int, const plaque::Batch& input) override {
for (const auto& t : input) {
auto in = In(t);
PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles());
if (in_buffer_store_) {
PLAQUE_OP_RETURN_IF_ERROR(
bstore_->DeleteRefs(handles.value->handles()));
} else {
PLAQUE_OP_RETURN_IF_ERROR(
tstore_->DeleteRefs(handles.value->handles()));
}
}
}
Use Floyd’s heap construction for efficient initialization.
Bulk initialization of a heap can be done in O(N) time, whereas adding one element at a time and updating the heap property after each addition requires O(N lg(N)) time.
Sometimes it is hard to change callers to use a new bulk API directly. In that case it might be beneficial to use a bulk API internally and cache the results for use in future non-bulk API calls:
Cache block decode results for use in future calls.
Each lookup needs to decode a whole block of K entries. Store the decoded entries in a cache and consult the cache on future lookups.
lexicon.cc
void GetTokenString(int pos, std::string* out) const {
...
absl::FixedArray<LexiconEntry, 32> entries(pos + 1);
// Decode all lexicon entries up to and including pos.
for (int i = 0; i <= pos; ++i) {
p = util::coding::TwoValuesVarint::Decode32(p, &entries[i].remaining,
&entries[i].shared);
entries[i].remaining_str = p;
p += entries[i].remaining; // remaining bytes trail each entry.
}
mutable std::vector<absl::InlinedVector<std::string, 16>> cache_;
...
void GetTokenString(int pos, std::string* out) const {
...
DCHECK_LT(skentry, cache_.size());
if (!cache_[skentry].empty()) {
*out = cache_[skentry][pos];
return;
}
...
// Init cache.
...
const char* prev = p;
for (int i = 0; i < block_sz; ++i) {
uint32 shared, remaining;
p = TwoValuesVarint::Decode32(p, &remaining, &shared);
auto& cur = cache_[skentry].emplace_back();
gtl::STLStringResizeUninitialized(&cur, remaining + shared);
std::memcpy(cur.data(), prev, shared);
std::memcpy(cur.data() + shared, p, remaining);
prev = cur.data();
p += remaining;
}
*out = cache_[skentry][pos];
View types
Prefer view types (e.g., std::string_view, std::Span<T>, absl::FunctionRef<R(Args...)>) for function arguments (unless ownership of the data is being transferred). These types reduce copying, and allow callers to pick their own container types (e.g., one caller might use std::vector whereas another one uses absl::InlinedVector).
Pre-allocated/pre-computed arguments
For frequently called routines, sometimes it is useful to allow higher-level callers to pass in a data structure that they own or information that the called routine needs that the client already has. This can avoid the low-level routine being forced to allocate its own temporary data structure or recompute already-available information.
Add RPC_Stats::RecordRPC variant allowing client to pass in already available WallTime value.
rpc-stats.h
static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m);
static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m,
WallTime now);
clientchannel.cc
const WallTime now = WallTime_Now();
...
RPC_Stats::RecordRPC(stats_name, m);
const WallTime now = WallTime_Now();
...
RPC_Stats::RecordRPC(stats_name, m, now);
Thread-compatible vs. Thread-safe types
A type may be either thread-compatible (synchronized externally) or thread-safe (synchronized internally). Most generally used types should be thread-compatible. This way callers who do not need thread-safety don’t pay for it.
Make a class thread-compatible since callers are already synchronized.
hitless-transfer-phase.cc
TransferPhase HitlessTransferPhase::get() const {
static CallsiteMetrics cm("HitlessTransferPhase::get");
MonitoredMutexLock l(&cm, &mutex_);
return phase_;
}
TransferPhase HitlessTransferPhase::get() const { return phase_; }
hitless-transfer-phase.cc
bool HitlessTransferPhase::AllowAllocate() const {
static CallsiteMetrics cm("HitlessTransferPhase::AllowAllocate");
MonitoredMutexLock l(&cm, &mutex_);
return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout;
}
bool HitlessTransferPhase::AllowAllocate() const {
return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout;
}
However if the typical use of a type needs synchronization, prefer to move the synchronization inside the type. This allows the synchronization mechanism to be tweaked as necessary to improve performance (e.g., sharding to reduce contention) without affecting callers.
Algorithmic improvements
The most critical opportunities for performance improvements come from algorithmic improvements, e.g., turning an O(N²) algorithm to O(N lg(N)) or O(N), avoiding potentially exponential behavior, etc. These opportunities are rare in stable code, but are worth paying attention to when writing new code. A few examples that show such improvements to pre-existing code:
Add nodes to cycle detection structure in reverse post-order.
We were previously adding graph nodes and edges one at a time to a cycle-detection data structure, which required expensive work per edge. We now add the entire graph in reverse post-order, which makes cycle-detection trivial.
graphcycles.h
class GraphCycles : public util_graph::Graph {
public:
GraphCycles();
~GraphCycles() override;
using Node = util_graph::Node;
class GraphCycles : public util_graph::Graph {
public:
GraphCycles();
~GraphCycles() override;
using Node = util_graph::Node;
// InitFrom adds all the nodes and edges from src, returning true if
// successful, false if a cycle is encountered.
// REQUIRES: no nodes and edges have been added to GraphCycles yet.
bool InitFrom(const util_graph::Graph& src);
graphcycles.cc
bool GraphCycles::InitFrom(const util_graph::Graph& src) {
...
// Assign ranks in topological order so we don't need any reordering during
// initialization. For an acyclic graph, DFS leaves nodes in reverse
// topological order, so we assign decreasing ranks to nodes as we leave them.
Rank last_rank = n;
auto leave = [&](util_graph::Node node) {
DCHECK(r->rank[node] == kMissingNodeRank);
NodeInfo* nn = &r->nodes[node];
nn->in = kNil;
nn->out = kNil;
r->rank[node] = --last_rank;
};
util_graph::DFSAll(src, std::nullopt, leave);
// Add all the edges (detect cycles as we go).
bool have_cycle = false;
util_graph::PerEdge(src, [&](util_graph::Edge e) {
DCHECK_NE(r->rank[e.src], kMissingNodeRank);
DCHECK_NE(r->rank[e.dst], kMissingNodeRank);
if (r->rank[e.src] >= r->rank[e.dst]) {
have_cycle = true;
} else if (!HasEdge(e.src, e.dst)) {
EdgeListAddNode(r, &r->nodes[e.src].out, e.dst);
EdgeListAddNode(r, &r->nodes[e.dst].in, e.src);
}
});
if (have_cycle) {
return false;
} else {
DCHECK(CheckInvariants());
return true;
}
}
graph_partitioner.cc
absl::Status MergeGraph::Init() {
const Graph& graph = *compiler_->graph();
clusters_.resize(graph.NodeLimit());
graph.PerNode([&](Node node) {
graph_->AddNode(node);
NodeList* n = new NodeList;
n->push_back(node);
clusters_[node] = n;
});
absl::Status s;
PerEdge(graph, [&](Edge e) {
if (!s.ok()) return;
if (graph_->HasEdge(e.src, e.dst)) return; // already added
if (!graph_->InsertEdge(e.src, e.dst)) {
s = absl::InvalidArgumentError("cycle in the original graph");
}
});
return s;
}
absl::Status MergeGraph::Init() {
const Graph& graph = *compiler_->graph();
if (!graph_->InitFrom(graph)) {
return absl::InvalidArgumentError("cycle in the original graph");
}
clusters_.resize(graph.NodeLimit());
graph.PerNode([&](Node node) {
NodeList* n = new NodeList;
n->push_back(node);
clusters_[node] = n;
});
return absl::OkStatus();
}
Replace the deadlock detection system built into a mutex implementation with a better algorithm.
Replaced deadlock detection algorithm by one that is ~50x as fast and scales to millions of mutexes without problem (the old algorithm relied on a 2K limit to avoid a performance cliff). The new code is based on the following paper: A dynamic topological sort algorithm for directed acyclic graphs David J. Pearce, Paul H. J. Kelly Journal of Experimental Algorithmics (JEA) JEA Homepage archive Volume 11, 2006, Article No. 1.7
The new algorithm takes O(|V|+|E|) space (instead of the O(|V|^2) bits needed by the older algorithm). Lock-acquisition order graphs are very sparse, so this is much less space. The algorithm is also quite simple: the core of it is ~100 lines of C++. Since the code now scales to much larger number of Mutexes, we were able to relax an artificial 2K limit, which uncovered a number of latent deadlocks in real programs.
Benchmark results: these were run in DEBUG mode since deadlock detection is mainly enabled in debug mode. The benchmark argument (/2k etc.) is the number of tracked nodes. At the default 2k limit of the old algorithm, the new algorithm takes only 0.5 microseconds per InsertEdge compared to 22 microseconds for the old algorithm. The new algorithm also easily scales to much larger graphs without problems whereas the old algorithm keels over quickly.
DEBUG: Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------
DEBUG: BM_StressTest/2k 23553 23566 29086
DEBUG: BM_StressTest/4k 45879 45909 15287
DEBUG: BM_StressTest/16k 776938 777472 817
DEBUG: BM_StressTest/2k 392 393 10485760
DEBUG: BM_StressTest/4k 392 393 10485760
DEBUG: BM_StressTest/32k 407 407 10485760
DEBUG: BM_StressTest/256k 456 456 10485760
DEBUG: BM_StressTest/1M 534 534 10485760
Replace an IntervalMap (with O(lg N) lookups) with a hash table (O(1) lookups).
The initial code was using IntervalMap because it seemed like the right data structure to support coalescing of adjacent blocks, but a hash table suffices since the adjacent block can be found by a hash table lookup. This (plus other changes in the CL) improve the performance of tpu::BestFitAllocator by ~4X.
best_fit_allocator.h
using Block = gtl::IntervalMap<int64, BlockState>::Entry;
...
// Map of pairs (address range, BlockState) with one entry for each allocation
// covering the range [0, allocatable_range_end_). Adjacent kFree and
// kReserved blocks are coalesced. Adjacent kAllocated blocks are not
// coalesced.
gtl::IntervalMap<int64, BlockState> block_list_;
// Set of all free blocks sorted according to the allocation policy. Adjacent
// free blocks are coalesced.
std::set<Block, BlockSelector> free_list_;
// A faster hash function for offsets in the BlockTable
struct OffsetHash {
ABSL_ATTRIBUTE_ALWAYS_INLINE size_t operator()(int64 value) const {
uint64 m = value;
m *= uint64_t{0x9ddfea08eb382d69};
return static_cast<uint64_t>(m ^ (m >> 32));
}
};
// Hash table maps from block start address to block info.
// We include the length of the previous block in this info so we
// can find the preceding block to coalesce with.
struct HashTableEntry {
BlockState state;
int64 my_length;
int64 prev_length; // Zero if there is no previous block.
};
using BlockTable = absl::flat_hash_map<int64, HashTableEntry, OffsetHash>;
Replace sorted-list intersection (O(N log N)) with hash table lookups (O(N)).
Old code to detect whether or not two nodes share a common source would get the sources for each node in sorted order and then do a sorted intersection. The new code places the sources for one node in a hash-table and then iterates over the other node’s sources checking the hash-table.
name old time/op new time/op delta
BM_CompileLarge 28.5s ± 2% 22.4s ± 2% -21.61% (p=0.008 n=5+5)
Implement good hash function so that things are O(1) instead of O(N).
location.h
// Hasher for Location objects.
struct LocationHash {
size_t operator()(const Location* key) const {
return key != nullptr ? util_hash::Hash(key->address()) : 0;
}
};
size_t HashLocation(const Location& loc);
...
struct LocationHash {
size_t operator()(const Location* key) const {
return key != nullptr ? HashLocation(*key) : 0;
}
};
location.cc
size_t HashLocation(const Location& loc) {
util_hash::MurmurCat m;
// Encode some simpler features into a single value.
m.AppendAligned((loc.dynamic() ? 1 : 0) //
| (loc.append_shard_to_address() ? 2 : 0) //
| (loc.is_any() ? 4 : 0) //
| (!loc.any_of().empty() ? 8 : 0) //
| (loc.has_shardmap() ? 16 : 0) //
| (loc.has_sharding() ? 32 : 0));
if (loc.has_shardmap()) {
m.AppendAligned(loc.shardmap().output() |
static_cast<uint64_t>(loc.shardmap().stmt()) << 20);
}
if (loc.has_sharding()) {
uint64_t num = 0;
switch (loc.sharding().type_case()) {
case Sharding::kModShard:
num = loc.sharding().mod_shard();
break;
case Sharding::kRangeSplit:
num = loc.sharding().range_split();
break;
case Sharding::kNumShards:
num = loc.sharding().num_shards();
break;
default:
num = 0;
break;
}
m.AppendAligned(static_cast<uint64_t>(loc.sharding().type_case()) |
(num << 3));
}
auto add_string = [&m](absl::string_view s) {
if (!s.empty()) {
m.Append(s.data(), s.size());
}
};
add_string(loc.address());
add_string(loc.lb_policy());
// We do not include any_of since it is complicated to compute a hash
// value that is not sensitive to order and duplication.
return m.GetHash();
}
Better memory representation
Careful consideration of memory footprint and cache footprint of important data structures can often yield big savings. The data structures below focus on supporting common operations by touching fewer cache lines. Care taken here can (a) avoid expensive cache misses (b) reduce memory bus traffic, which speeds up both the program in question and anything else running on the same machine. They rely on some common techniques you may find useful when designing your own data structures.
Compact data structures
Use compact representations for data that will be accessed often or that comprises a large portion of the application’s memory usage. A compact representation can significantly reduce memory usage and improve performance by touching fewer cache lines and reducing memory bus bandwidth usage. However, watch out for cache-line contention.
Memory layout
Carefully consider the memory layout of types that have a large memory or cache footprint.
- Reorder fields to reduce padding between fields with different alignment requirements (see class layout discussion).
- Use smaller numeric types where the stored data will fit in the smaller type.
- Enum values sometimes take up a whole word unless you’re careful. Consider using a smaller representation (e.g., use
enum class OpType : uint8_t { ...
}
instead of enum class OpType { ... }).
- Order fields so that fields that are frequently accessed together are closer to each other – this will reduce the number of cache lines touched on common operations.
- Place hot read-only fields away from hot mutable fields so that writes to the mutable fields do not cause the read-only fields to be evicted from nearby caches.
- Move cold data so it does not live next to hot data, either by placing the cold data at the end of the struct, or behind a level of indirection, or in a separate array.
- Consider packing things into fewer bytes by using bit and byte-level encoding. This can be complicated, so only do this when the data under question is encapsulated inside a well-tested module, and the overall reduction of memory usage is significant. Furthermore, watch out for side effects like under-alignment of frequently used data, or more expensive code for accessing packed representations. Validate such changes using benchmarks.
Indices instead of pointers
On modern 64-bit machines, pointers take up 64 bits. If you have a pointer-rich data structure, you can easily chew up lots of memory with indirections of T*. Instead, consider using integer indices into an array T[] or other data structure. Not only will the references be smaller (if the number of indices is small enough to fit in 32 or fewer bits), but the storage for all the T[] elements will be contiguous, often leading to better cache locality.
Batched storage
Avoid data structures that allocate a separate object per stored element (e.g., std::map, std::unordered_map in C++). Instead, consider types that use chunked or flat representations to store multiple elements in close proximity in memory (e.g., std::vector, absl::flat_hash_{map,set} in C++). Such types tend to have much better cache behavior. Furthermore, they encounter less allocator overhead.
One useful technique is to partition elements into chunks where each chunk can hold a fixed number of elements. This technique can reduce the cache footprint of a data structure significantly while preserving good asymptotic behavior.
For some data structures, a single chunk suffices to hold all elements (e.g., strings and vectors). Other types (e.g., absl::flat_hash_map) also use this technique.
Inlined storage
Some container types are optimized for storing a small number of elements. These types provide space for a small number of elements at the top level and completely avoid allocations when the number of elements is small. This can be very helpful when instances of such types are constructed often (e.g., as stack variables in frequently executed code), or if many instances are live at the same time. If a container will typically contain a small number of elements consider using one of the inlined storage types, e.g., InlinedVector.
Caveat: if sizeof(T) is large, inlined storage containers may not be the best choice since the inlined backing store will be large.
Unnecessarily nested maps
Sometimes a nested map data structure can be replaced with a single-level map with a compound key. This can reduce the cost of lookups and insertions significantly.
Reduce allocations and improve cache footprint by converting btree<a,btree<b,c>> to btree<pair<a,b>,c>.
graph_splitter.cc
absl::btree_map<std::string, absl::btree_map<std::string, OpDef>> ops;
// The btree maps from {package_name, op_name} to its const Opdef*.
absl::btree_map<std::pair<absl::string_view, absl::string_view>,
const OpDef*>
ops;
Caveat: if the first map key is big, it might be better to stick with nested maps:
Switch to a nested map leads to 76% performance improvement in microbenchmark.
We previously had a single-level hash table where the key consisted of a (string) path and some other numeric sub-keys. Each path occurred in approximately 1000 keys on average. We split the hash table into two levels where the first level was keyed by the path and each second level hash table kept just the sub-key to data mapping for a particular path. This reduced the memory usage for storing paths by a factor of 1000, and also sped up accesses where many sub-keys for the same path were accessed together.
Arenas
Arenas can help reduce memory allocation cost, but they also have the benefit of packing together independently allocated items next to each other, typically in fewer cache lines, and eliminating most destruction costs. They are likely most effective for complex data structures with many sub-objects. Consider providing an appropriate initial size for the arena since that can help reduce allocations.
Caveat: it is easy to misuse arenas by putting too many short-lived objects in a long-lived arena, which can unnecessarily bloat memory footprint.
Arrays instead of maps
If the domain of a map can be represented by a small integer or is an enum, or if the map will have very few elements, the map can sometimes be replaced by an array or a vector of some form.
Use an array instead of flat_map.
rtp_controller.h
const gtl::flat_map<int, int> payload_type_to_clock_frequency_;
// A map (implemented as a simple array) indexed by payload_type to clock freq
// for that paylaod type (or 0)
struct PayloadTypeToClockRateMap {
int map[128];
};
...
const PayloadTypeToClockRateMap payload_type_to_clock_frequency_;
Bit vectors instead of sets
If the domain of a set can be represented by a small integer, the set can be replaced with a bit vector (InlinedBitVector is often a good choice). Set operations can also be nicely efficient on these representations using bitwise boolean operations (OR for union, AND for intersection, etc.).
Spanner placement system. Replace dense_hash_set<ZoneId> with a bit-vector with one bit per zone.
zone_set.h
class ZoneSet: public dense_hash_set<ZoneId> {
public:
...
bool Contains(ZoneId zone) const {
return count(zone) > 0;
}
class ZoneSet {
...
// Returns true iff "zone" is contained in the set
bool ContainsZone(ZoneId zone) const {
return zone < b_.size() && b_.get_bit(zone);
}
...
private:
int size_; // Number of zones inserted
util::bitmap::InlinedBitVector<256> b_;
Benchmark results:
CPU: AMD Opteron (4 cores) dL1:64KB dL2:1024KB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_Evaluate/1 960 676 +29.6%
BM_Evaluate/2 1661 1138 +31.5%
BM_Evaluate/3 2305 1640 +28.9%
BM_Evaluate/4 3053 2135 +30.1%
BM_Evaluate/5 3780 2665 +29.5%
BM_Evaluate/10 7819 5739 +26.6%
BM_Evaluate/20 17922 12338 +31.2%
BM_Evaluate/40 36836 26430 +28.2%
Use bit matrix to keep track of reachability properties between operands instead of hash table.
hlo_computation.h
using TransitiveOperandMap =
std::unordered_map<const HloInstruction*,
std::unordered_set<const HloInstruction*>>;
class HloComputation::ReachabilityMap {
...
// dense id assignment from HloInstruction* to number
tensorflow::gtl::FlatMap<const HloInstruction*, int> ids_;
// matrix_(a,b) is true iff b is reachable from a
tensorflow::core::Bitmap matrix_;
};
Reduce allocations
Memory allocation adds costs:
- It increases the time spent in the allocator.
- Newly-allocated objects may require expensive initialization and sometimes corresponding expensive destruction when no longer needed.
- Every allocation tends to be on a new cache line and therefore data spread across many independent allocations will have a larger cache footprint than data spread across fewer allocations.
Garbage-collection runtimes sometimes obviate issue #3 by placing consecutive allocations sequentially in memory.
Avoid unnecessary allocations
Reducing allocations increases benchmark throughput by 21%.
memory_manager.cc
LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo,
bool is_batched)
: tensor(std::move(t)),
device_info(dinfo ? std::move(dinfo) : std::make_shared<DeviceInfo>()),
is_batched(is_batched) {
static const std::shared_ptr<DeviceInfo>& empty_device_info() {
static std::shared_ptr<DeviceInfo>* result =
new std::shared_ptr<DeviceInfo>(new DeviceInfo);
return *result;
}
LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo,
bool is_batched)
: tensor(std::move(t)), is_batched(is_batched) {
if (dinfo) {
device_info = std::move(dinfo);
} else {
device_info = empty_device_info();
}
Use statically-allocated zero vector when possible rather than allocating a vector and filling it with zeroes.
embedding_executor_8bit.cc
// The actual implementation of the EmbeddingLookUpT using template parameters
// instead of object members to improve the performance.
template <bool Mean, bool SymmetricInputRange>
static tensorflow::Status EmbeddingLookUpT(...) {
...
std::unique_ptr<tensorflow::quint8[]> zero_data(
new tensorflow::quint8[max_embedding_width]);
memset(zero_data.get(), 0, sizeof(tensorflow::quint8) * max_embedding_width);
// A size large enough to handle most embedding widths
static const int kTypicalMaxEmbedding = 256;
static tensorflow::quint8 static_zero_data[kTypicalMaxEmbedding]; // All zeroes
...
// The actual implementation of the EmbeddingLookUpT using template parameters
// instead of object members to improve the performance.
template <bool Mean, bool SymmetricInputRange>
static tensorflow::Status EmbeddingLookUpT(...) {
...
std::unique_ptr<tensorflow::quint8[]> zero_data_backing(nullptr);
// Get a pointer to a memory area with at least
// "max_embedding_width" quint8 zero values.
tensorflow::quint8* zero_data;
if (max_embedding_width <= ARRAYSIZE(static_zero_data)) {
// static_zero_data is big enough so we don't need to allocate zero data
zero_data = &static_zero_data[0];
} else {
// static_zero_data is not big enough: we need to allocate zero data
zero_data_backing =
absl::make_unique<tensorflow::quint8[]>(max_embedding_width);
memset(zero_data_backing.get(), 0,
sizeof(tensorflow::quint8) * max_embedding_width);
zero_data = zero_data_backing.get();
}
Also, prefer stack allocation over heap allocation when object lifetime is bounded by the scope (although be careful with stack frame sizes for large objects).
Resize or reserve containers
When the maximum or expected maximum size of a vector (or some other container types) is known in advance, pre-size the container’s backing store (e.g., using resize or reserve in C++).
Pre-size a vector and fill it in, rather than N push_back operations.
indexblockdecoder.cc
for (int i = 0; i < ndocs-1; i++) {
uint32 delta;
ERRORCHECK(b->GetRice(rice_base, &delta));
docs_.push_back(DocId(my_shard_ + (base + delta) * num_shards_));
base = base + delta + 1;
}
docs_.push_back(last_docid_);
docs_.resize(ndocs);
DocId* docptr = &docs_[0];
for (int i = 0; i < ndocs-1; i++) {
uint32 delta;
ERRORCHECK(b.GetRice(rice_base, &delta));
*docptr = DocId(my_shard_ + (base + delta) * num_shards_);
docptr++;
base = base + delta + 1;
}
*docptr = last_docid_;
Caveat: Do not use resize or reserve to grow one element at a time since that may lead to quadratic behavior. Also, if element construction is expensive, prefer an initial reserve call followed by several push_back or emplace_back calls instead of an initial resize since that will double the number of constructor calls.
Avoid copying when possible
- Prefer moving to copying data structures when possible.
- If lifetime is not an issue, store pointers or indices instead of copies of objects in transient data structures. E.g., if a local map is used to select a set of protos from an incoming list of protos, we can make the map store just pointers to the incoming protos instead of copying potentially deeply nested data. Another common example is sorting a vector of indices rather than sorting a vector of large objects directly since the latter would incur significant copying/moving costs.
Avoid an extra copy when receiving a tensor via gRPC.
A benchmark that sends around 400KB tensors speeds up by ~10-15%:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------
BM_RPC/30/98k_mean 148764691 1369998944 1000
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------
BM_RPC/30/98k_mean 131595940 1216998084 1000
Move large options structure rather than copying it.
index.cc
return search_iterators::DocPLIteratorFactory::Create(opts);
return search_iterators::DocPLIteratorFactory::Create(std::move(opts));
Use std::sort instead of std::stable_sort, which avoids an internal copy inside the stable sort implementation.
encoded-vector-hits.h
std::stable_sort(hits_.begin(), hits_.end(),
gtl::OrderByField(&HitWithPayloadOffset::docid));
struct HitWithPayloadOffset {
search_iterators::LocalDocId64 docid;
int first_payload_offset; // offset into the payload vector.
int num_payloads;
bool operator<(const HitWithPayloadOffset& other) const {
return (docid < other.docid) ||
(docid == other.docid &&
first_payload_offset < other.first_payload_offset);
}
};
...
std::sort(hits_.begin(), hits_.end());
Reuse temporary objects
A container or an object declared inside a loop will be recreated on every loop iteration. This can lead to expensive construction, destruction, and resizing. Hoisting the declaration outside the loop enables reuse and can provide a significant performance boost. (Compilers are often unable to do such hoisting on their own due to language semantics or their inability to ensure program equivalence.)
Hoist variable definition outside of loop iteration.
autofdo_profile_utils.h
auto iterator = absl::WrapUnique(sstable->GetIterator());
while (!iterator->done()) {
T profile;
if (!profile.ParseFromString(iterator->value_view())) {
return absl::InternalError(
"Failed to parse mem_block to specified profile type.");
}
...
iterator->Next();
}
auto iterator = absl::WrapUnique(sstable->GetIterator());
T profile;
while (!iterator->done()) {
if (!profile.ParseFromString(iterator->value_view())) {
return absl::InternalError(
"Failed to parse mem_block to specified profile type.");
}
...
iterator->Next();
}
Define a protobuf variable outside a loop so that its allocated storage can be reused across loop iterations.
stats-router.cc
for (auto& r : routers_to_update) {
...
ResourceRecord record;
{
MutexLock agg_lock(r.agg->mutex());
r.agg->AddResourceRecordUsages(measure_indices, &record);
}
...
}
ResourceRecord record;
for (auto& r : routers_to_update) {
...
record.Clear();
{
MutexLock agg_lock(r.agg->mutex());
r.agg->AddResourceRecordUsages(measure_indices, &record);
}
...
}
Serialize to same std::string repeatedly.
program_rep.cc
std::string DeterministicSerialization(c