Logging Without the Lock Contention Tax
Greg Law demonstrated a logging library achieving 1 nanosecond latency in single-threaded cases by eliminating string manipulation at log time. Traditional fprintf takes 133 nanoseconds and includes a lock that serializes all threads. Speed log suffers from the same mutex contention. Law’s L3 library uses fixed-size records with string pointers, writing only to memory-mapped buffers with atomic operations. Post-processing resolves pointers using RO data section offsets. The approach scales dramatically better with thread count since writers never block each other.
Exception Performance: From Microseconds to Hundreds of Nanoseconds
Khalil Estell cut C++ exception propagat…
Logging Without the Lock Contention Tax
Greg Law demonstrated a logging library achieving 1 nanosecond latency in single-threaded cases by eliminating string manipulation at log time. Traditional fprintf takes 133 nanoseconds and includes a lock that serializes all threads. Speed log suffers from the same mutex contention. Law’s L3 library uses fixed-size records with string pointers, writing only to memory-mapped buffers with atomic operations. Post-processing resolves pointers using RO data section offsets. The approach scales dramatically better with thread count since writers never block each other.
Exception Performance: From Microseconds to Hundreds of Nanoseconds
Khalil Estell cut C++ exception propagation time by 90% on embedded systems by optimizing the unwind table search algorithm. Standard implementations use linear search through exception handling tables. By implementing a nearpoint search algorithm and eliminating dynamic memory allocation, exception handling dropped from microseconds to hundreds of nanoseconds on a 64MHz ARM Cortex-M4. Benchmarks used 50-deep call stacks with varying cleanup percentages, measured via logic analyzer to avoid OS interference. The work removes thread-local storage requirements and makes exceptions viable for hard real-time systems.
Lock-Free Synchronization with Hazard Pointers
Denis Yaroshevskiy explained hazard pointers for lock-free config access at Meta. The standard approach using shared_mutex and shared_ptr requires every reader to write to shared state and increment counters. Atomic shared_ptr improves this but still has contention. Hazard pointers in C++26 use thread-local storage: readers write pointers to their own TLS to signal “don’t delete this” while the reclamation thread checks all TLS before deleting. The read path becomes just TLS writes with no shared state modification. The protect operation atomically loads and registers protection in one step.
When Domain Knowledge Unlocks Million-Times Speedups
Andrew Drakeford demonstrated a million-times speedup in leave-one-out regression by understanding the problem domain beyond cache locality and SIMD. Using George Polya’s problem-solving framework, he moved from seconds to 18 milliseconds. The key insight: don’t just optimize data layout and execution—exploit domain-specific mathematical properties and constraints. Most data-oriented design focuses on spatial/temporal layout and vectorization, missing algorithmic wins from understanding recurring patterns in data physics. Drawing diagrams and decomposing problems reveals transformations that make hard problems tractable.
Inter-Process Queue Design and Atomic Operations
Jody Hagins covered robust inter-process message queues using shared memory and atomic counters. Unlike task queues where workers compete, message queues deliver to all consumers. The challenge: processes have different address spaces, so pointers don’t work. Solutions require atomic read/write indices without locks. Ben Saks implemented atomics for Arduino Mega 2560 without std::atomic support. Ring buffers use head/tail indices that wrap circularly. Operations must be both non-preemptable and synchronizable for correctness. The index type (uint8_t vs uint16_t) affects atomic implementation complexity on different platforms.
Memory Profiling with Ownership Semantics
Alecto Irene Perez introduced MemProfile, an ownership-aware memory profiler that tracks which objects hold memory, not just where allocations occur. Traditional tools like Valgrind and heap track show allocation call stacks but don’t explain why memory isn’t freed. MemProfile understands C++ containers, template specialization, virtual inheritance, and placement new. It reports memory usage per member, indexed array elements, and even lambda captures. The profiler uses debug info to resolve field names and offsets, enabling queries like “which members of this type consume the most memory” rather than just “where was this allocated.”
Automatic Differentiation Performance
Steve Bronder optimized reverse-mode automatic differentiation for statistical models, where gradient calculations consume 60-70% of runtime. FastAD achieves near-hand-written performance by avoiding flexibility-for-performance tradeoffs. Key insight: break functions into sub-expressions and apply chain rule systematically. Memory-intensive operations benefit from modern C++ features. COVID models in the UK used autodiff-powered MCMC to forecast infection rates and guide lockdown decisions. The technique is fundamental to all LLM training. Performance varies 10x between implementations depending on whether they prioritize flexibility or optimization.
Also Worth Watching
Sean Parent on Local reasoning principles and binary search correctness, Lucian on Senders/receivers framework complexity comparable to std::function, Hari Prasad Manoharan on Binary parsing with C++23 template metaprogramming, Timur Doumler on Real-time audio programming constraints and C++ standards work, Vittorio Romeo on Data-oriented design for particle systems with dramatic framerate improvements.
This Week’s Takeaway
The common thread across these talks: understanding your problem domain deeply matters more than applying generic optimization patterns. Whether it’s eliminating locks through clever data structures, exploiting mathematical properties for algorithmic wins, or choosing the right profiling abstraction, the biggest gains come from questioning assumptions. Today’s actionable item: next time you optimize, spend 30 minutes understanding what your code actually does at the domain level before touching cache lines or adding SIMD.