Choosing our Benchmarking Strategy
We are going to use pytest-codspeed because it integrates seamlessly with pytest, the most popular Python testing framework. Your benchmarks live right alongside your tests using the same familiar syntax, no separate infrastructure to maintain. Plus, all of pytestβs ecosystem (parametrization, fixtures, plugins) works seamlessly with your benchmarks. You can even turn existing tests into benchmarks by adding a single decorator.
Your First Benchmark
Letβs start by creating a simple benchmark for a recursive Fibonacci function.
Installation
First, add pytest-codspeed to your projectβs dependencies using uv:
uv add --dev ...
Choosing our Benchmarking Strategy
We are going to use pytest-codspeed because it integrates seamlessly with pytest, the most popular Python testing framework. Your benchmarks live right alongside your tests using the same familiar syntax, no separate infrastructure to maintain. Plus, all of pytestβs ecosystem (parametrization, fixtures, plugins) works seamlessly with your benchmarks. You can even turn existing tests into benchmarks by adding a single decorator.
Your First Benchmark
Letβs start by creating a simple benchmark for a recursive Fibonacci function.
Installation
First, add pytest-codspeed to your projectβs dependencies using uv:
uv add --dev pytest-codspeed
Writing the Benchmark
Create a new file tests/test_benchmarks.py:
tests/test_benchmarks.py
import pytest
# Define the function we want to benchmark
def fibonacci(n: int) -> int:
if n <= 1:
return n
else:
return fibonacci(n - 2) + fibonacci(n - 1)
# Register a simple benchmark using the pytest marker
@pytest.mark.benchmark
def test_fib_bench():
result = fibonacci(30)
assert result == 832040
A few things to note: @pytest.mark.benchmark is a standard pytest marker that marks this test as a benchmark. The entire test function is measured, including both the computation and the assertion. Itβs just a regular pytest test, so you can run it with pytest as usual. The test validates correctness (via assertions) and tracks performance at the same time.
Running the Benchmark
Now run your benchmark:
uv run pytest tests/ --codspeed
You should see output like this:
=============================== test session starts ===============================
platform darwin -- Python 3.13.3, pytest-8.4.2, pluggy-1.6.0
codspeed: 4.2.0 (enabled, mode: walltime, callgraph: not supported, timer_resolution: 41.7ns)
CodSpeed had to disable the following plugins: pytest-benchmark
benchmark: 5.2.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/user/projects/CodSpeedHQ/docs-guides/python
configfile: pyproject.toml
plugins: benchmark-5.2.1, codspeed-4.2.0
collected 1 item
tests/test_benchmarks.py . [100%]
Benchmark Results
ββββββββββββββββββ³ββββββββββββββ³ββββββββββββββ³βββββββββββ³ββββββββ
β Benchmark β Time (best) β Rel. StdDev β Run time β Iters β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β test_fib_bench β 73.1ms β 2.1% β 2.96s β 40 β
ββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄ββββββββ
================================== 1 benchmarked ==================================
================================ 1 passed in 4.09s ================================
Congratulations! Youβve created your first benchmark. Here in this output, you can see that test_fib_bench takes about 73 milliseconds to compute fibonacci(30). It ran 40 times in 2.96 seconds to get a reliable measurement.
Benchmarking with Arguments
So far, weβve only tested our function with a single input value (30). But what if we want to see how performance changes with different input sizes? This is where pytestβs @pytest.mark.parametrize comes in, and it works seamlessly with benchmarks! Letβs update our benchmark to test multiple input sizes:
tests/test_benchmarks.py
@pytest.mark.benchmark
@pytest.mark.parametrize("n", [5, 10, 15, 20, 30])
def test_fib_parametrized(n):
result = fibonacci(n)
assert result > 0
When you run this benchmark, pytest will create separate test instances for each parameter value, allowing you to compare performance across different inputs:
=============================== test session starts ===============================
platform darwin -- Python 3.13.3, pytest-8.4.2, pluggy-1.6.0
codspeed: 4.2.0 (enabled, mode: walltime, callgraph: not supported, timer_resolution: 41.7ns)
CodSpeed had to disable the following plugins: pytest-benchmark
benchmark: 5.2.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/user/projects/CodSpeedHQ/docs-guides/python
configfile: pyproject.toml
plugins: benchmark-5.2.1, codspeed-4.2.0
collected 5 items
tests/test_benchmarks.py ..... [100%]
Benchmark Results
βββββββββββββββββββββββββββββ³ββββββββββββββ³ββββββββββββββ³βββββββββββ³ββββββββββββ
β Benchmark β Time (best) β Rel. StdDev β Run time β Iters β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β test_fib_parametrized[5] β 0ns β 1.7% β 2.92s β 1,026,802 β
β test_fib_parametrized[10] β 1ns β 1.7% β 2.89s β 395,754 β
β test_fib_parametrized[15] β 76ns β 0.8% β 2.94s β 52,256 β
β test_fib_parametrized[20] β 8.49Β΅s β 3.6% β 3.00s β 4,970 β
β test_fib_parametrized[30] β 72.9ms β 0.7% β 2.94s β 40 β
βββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄ββββββββββββ
================================== 5 benchmarked ==================================
=============================== 5 passed in 19.88s ================================
Notice how parametrization creates five separate benchmarks, one for each input value. The results reveal the exponential time complexity of our recursive Fibonacci implementation: fibonacci(5) takes virtually no time (0ns) and runs over 1 million iterations, while fibonacci(30) takes 72.9ms and runs only 40 times. This dramatic difference (from nanoseconds to milliseconds) demonstrates how quickly recursive Fibonacci becomes expensive as the input grows.
Multiple Parameters
You can also benchmark across multiple dimensions:
tests/test_benchmarks.py
def fibonacci_iterative(n: int) -> int:
if n <= 1:
return 1
a, b = 1, 1
for _ in range(n - 1):
a, b = b, a + b
return b
@pytest.mark.benchmark
@pytest.mark.parametrize("algorithm, n", [
("recursive", 10),
("recursive", 20),
("iterative", 100),
("iterative", 200),
])
def test_fib_algorithms(algorithm, n):
if algorithm == "recursive":
result = fibonacci(n)
else:
result = fibonacci_iterative(n)
assert result > 0
Then run it:
=============================== test session starts ===============================
platform darwin -- Python 3.13.3, pytest-8.4.2, pluggy-1.6.0
codspeed: 4.2.0 (enabled, mode: walltime, callgraph: not supported, timer_resolution: 41.7ns)
CodSpeed had to disable the following plugins: pytest-benchmark
benchmark: 5.2.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/user/projects/CodSpeedHQ/docs-guides/python
configfile: pyproject.toml
plugins: benchmark-5.2.1, codspeed-4.2.0
collected 4 items
tests/test_benchmarks.py .... [100%]
Benchmark Results
ββββββββββββββββββββββββββββββββββββββ³βββββββββββ³ββββββββββββ³βββββββββββ³βββββββββββ
β β Time β Rel. β β β
β Benchmark β (best) β StdDev β Run time β Iters β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β test_fib_algorithms[recursive-10] β 1ns β 1.0% β 2.93s β 614,789 β
β test_fib_algorithms[recursive-20] β 8.49Β΅s β 26.9% β 3.01s β 4,970 β
β test_fib_algorithms[iterative-100] β 0ns β 42.1% β 3.04s β 1,474,1β¦ β
β test_fib_algorithms[iterative-200] β 0ns β 1.3% β 2.29s β 587,099 β
ββββββββββββββββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββββ΄βββββββββββ
================================== 4 benchmarked ==================================
=============================== 4 passed in 15.40s ================================
This benchmark creates four separate test cases, one for each combination of algorithm and input size. The output clearly shows the dramatic performance difference between the two implementations: the iterative version handles much larger inputs (100, 200) in virtually no time, while the recursive version takes 8.49Β΅s just for n=20. Notice how fibonacci_iterative(200) runs over 500,000 iterations in the same time budget that fibonacci(20) only manages about 5,000. This makes it easy to compare different algorithmic approaches and choose the most efficient implementation for your use case.
Benchmarking Only What Matters
Sometimes, you have expensive setup that shouldnβt be included in your benchmark measurements. For example, generating large datasets, creating complex data structures, or preparing test data. This is where the benchmark fixture comes in. The benchmark fixture gives you precise control over what gets measured. Letβs benchmark a data analysis function that identifies outliers in numerical data. The expensive part is generating the test dataset, but we only want to measure the outlier detection algorithm:
tests/test_outlier_detection.py
import pytest
import random
def generate_dataset(size: int) -> list[float]:
"""Generate a large dataset with some outliers (expensive operation)."""
random.seed(42) # Fixed seed for reproducibility
data = []
for _ in range(size):
# 95% normal values from a normal distribution
if random.random() < 0.95:
data.append(random.gauss(100.0, 15.0))
else:
# 5% outliers
data.append(random.uniform(200.0, 300.0))
return data
def detect_outliers(data: list[float], threshold: float = 2.0) -> list[int]:
"""Detect outliers using z-score method (what we want to benchmark)."""
# Calculate mean
mean = sum(data) / len(data)
# Calculate standard deviation
variance = sum((x - mean) ** 2 for x in data) / len(data)
std_dev = variance ** 0.5
# Find outliers
outliers = []
for i, value in enumerate(data):
z_score = abs((value - mean) / std_dev) if std_dev > 0 else 0
if z_score > threshold:
outliers.append(i)
return outliers
# Benchmark for dataset generation
@pytest.mark.benchmark
@pytest.mark.parametrize("size", [10_000, 100_000, 1_000_000])
def test_generate_dataset(size):
generate_dataset(size)
# Benchmark for outlier detection only
@pytest.mark.parametrize("size", [10_000, 100_000, 1_000_000])
def test_outlier_detection(benchmark, size):
# NOT MEASURED: Expensive setup - generate large dataset
dataset = generate_dataset(size)
# MEASURED: Only the outlier detection algorithm
result = benchmark(detect_outliers, dataset)
# NOT MEASURED: Assertions
assert len(result) > 0 # We should find some outliers
assert all(isinstance(idx, int) for idx in result)
The setup code (generating the dataset) runs once, and only the detect_outliers() call inside benchmark() is measured. This gives you accurate performance data without the noise of test setup. Letβs run this benchmark by filtering the pytest command to just this file:
uv run pytest tests/test_outlier_detection.py --codspeed
You should see output like this:
=============================== test session starts ===============================
platform darwin -- Python 3.13.3, pytest-8.4.2, pluggy-1.6.0
codspeed: 4.2.0 (enabled, mode: walltime, callgraph: not supported, timer_resolution: 41.7ns)
CodSpeed had to disable the following plugins: pytest-benchmark
benchmark: 5.2.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/user/projects/CodSpeedHQ/docs-guides/python
configfile: pyproject.toml
plugins: benchmark-5.2.1, codspeed-4.2.0
collected 6 items
tests/test_outlier_detection.py ...... [100%]
Benchmark Results
βββββββββββββββββββββββββββββββββββ³ββββββββββββββ³ββββββββββββββ³βββββββββββ³ββββββββ
β Benchmark β Time (best) β Rel. StdDev β Run time β Iters β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β test_generate_dataset[10000] β 124.29Β΅s β 4.4% β 2.92s β 1,278 β
β test_generate_dataset[100000] β 11.0ms β 6.2% β 2.96s β 130 β
β test_generate_dataset[1000000] β 225.5ms β 29.6% β 3.04s β 13 β
β test_outlier_detection[10000] β 46.01Β΅s β 18.4% β 2.89s β 2,059 β
β test_outlier_detection[100000] β 3.3ms β 6.1% β 3.04s β 220 β
β test_outlier_detection[1000000] β 132.4ms β 12.6% β 3.04s β 22 β
βββββββββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄ββββββββ
================================== 6 benchmarked ==================================
=============================== 6 passed in 24.84s ================================
The results reveal a crucial insight about what weβre actually measuring. Notice the dramatic difference between the two benchmark groups: Dataset generation (test_generate_dataset):
- 10k elements: 124.29Β΅s
- 100k elements: 11.0ms (88x slower)
- 1M elements: 225.5ms (1,814x slower than 10k)
Outlier detection (test_outlier_detection):
- 10k elements: 46.01Β΅s
- 100k elements: 3.3ms (72x slower)
- 1M elements: 132.4ms (2,878x slower than 10k)
This comparison shows that for the 1M element dataset, dataset generation takes 225.5ms while outlier detection takes 132.4ms, the setup is actually slower than the algorithm we want to measure! Without using the benchmark fixture to exclude the setup, our measurements would include both operations, making it impossible to understand the true performance of the outlier detection algorithm. The benchmark fixture ensures we measure only what matters: the algorithm itself, not the test infrastructure around it.
Additional Techniques
Marking an Entire Module
If you have a dedicated benchmarks file, you can mark all tests as benchmarks at once using pytestβs module-level marking:
tests/benchmarks/test_math_operations.py
import pytest
# Mark all tests in this module as benchmarks
pytestmark = pytest.mark.benchmark
def test_sum_squares():
# MEASURED: Everything in this test
result = sum(i**2 for i in range(1000))
assert result > 0
def test_sum_cubes():
# MEASURED: Everything in this test
result = sum(i**3 for i in range(1000))
assert result > 0
Now all tests in this file are automatically benchmarked without individual decorators. This is incredibly useful for benchmark-specific test files!
Fine-Grained Control with Pedantic
For maximum control over your benchmarks, use benchmark.pedantic(). This allows you to specify custom setup and teardown functions, control the number of rounds and iterations, configure warmup behavior, and more:
tests/test_advanced.py
import json
import pytest
def parse_json_data(json_string: str) -> dict:
"""Parse JSON string into a dictionary."""
return json.loads(json_string)
@pytest.mark.parametrize("size", [10_000, 30_000])
def test_json_parsing(benchmark, size):
# NOT MEASURED: Setup to create test data
items = [{"id": i, "name": f"item_{i}", "value": i * 10} for i in range(size)]
json_string = json.dumps(items)
# MEASURED: Only the parse_json_data() function
result = benchmark.pedantic(
parse_json_data, # Function to benchmark
args=(json_string,), # Arguments to the function
rounds=100, # Number of benchmark rounds
iterations=10, # Iterations per round
warmup_rounds=2 # Warmup rounds before measuring
)
# NOT MEASURED: The assertion
assert len(result) == size
Here is the output when you run this benchmark:
=============================== test session starts ===============================
platform darwin -- Python 3.13.3, pytest-8.4.2, pluggy-1.6.0
codspeed: 4.2.0 (enabled, mode: walltime, callgraph: not supported, timer_resolution: 41.7ns)
CodSpeed had to disable the following plugins: pytest-benchmark
benchmark: 5.2.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/user/projects/CodSpeedHQ/docs-guides/python
configfile: pyproject.toml
plugins: benchmark-5.2.1, codspeed-4.2.0
collected 2 items
tests/test_advanced.py .. [100%]
Benchmark Results
ββββββββββββββββββββββββββββ³ββββββββββββββ³ββββββββββββββ³βββββββββββ³ββββββββ
β Benchmark β Time (best) β Rel. StdDev β Run time β Iters β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β test_json_parsing[10000] β 294.85Β΅s β 0.9% β 2.99s β 1,000 β
β test_json_parsing[30000] β 973.01Β΅s β 0.7% β 9.88s β 1,000 β
ββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄ββββββββ
================================== 2 benchmarked ==================================
=============================== 2 passed in 13.20s ================================
We can see that as expected each benchmark ran 100 rounds of 10 iterations each, totalling 1,000 iterations. Using benchmark.pedantic() is especially useful for bigger benchmarks where you need precise control over rounds, iterations, and warmup behavior.
Best Practices
Use Assertions to Verify Correctness
Since benchmarks are just pytest tests, they should include assertions to verify correctness:
# β BAD: No verification
@pytest.mark.benchmark
def test_computation():
result = expensive_computation()
# Oops, forgot to check if result is correct!
# β
GOOD: Verify the result without measuring the assertion
def test_computation(benchmark):
result = benchmark(expensive_computation)
assert result == expected_value
This ensures youβre benchmarking correct code, not broken code that happens to be fast. Or, as we briefly said in the introduction, you can turn existing tests into benchmarks by adding the @pytest.mark.benchmark decorator.
# Existing correctness test
def test_sorting_algorithm():
data = [5, 2, 9, 1]
result = sorting_algorithm(data)
assert result == [1, 2, 5, 9]
# Turn it into a benchmark using the benchmark fixture
def test_sorting_algorithm(benchmark):
data = [5, 2, 9, 1]
result = benchmark(sorting_algorithm, data)
assert result == [1, 2, 5, 9]
Keep Benchmarks Deterministic
Your benchmarks should produce consistent results across runs:
# β BAD: Non-deterministic due to random data
def test_sort_random(benchmark):
import random
data = [random.randint(1, 1000) for _ in range(100)]
benchmark(sorted, data)
# β
GOOD: Use a fixed seed or deterministic data
def test_sort_deterministic(benchmark):
import random
random.seed(42) # Fixed seed for reproducibility
data = [random.randint(1, 1000) for _ in range(100)]
benchmark(sorted, data)
# β
EVEN BETTER: Use deterministic data
def test_sort_worst_case(benchmark):
data = list(range(100, 0, -1)) # Always the same
benchmark(sorted, data)
Benchmarking Your Own Package
Following Python best practices, your source code should live in a src/ directory. Hereβs a typical project structure:
my_project/
βββ pyproject.toml
βββ src/
β βββ mylib/
β βββ __init__.py
β βββ algorithms.py
βββ tests/
βββ test_algorithms.py # Regular unit tests
βββ benchmarks/ # Performance benchmarks
βββ test_algorithm_performance.py
Your source code in src/mylib/algorithms.py:
src/mylib/algorithms.py
def quick_sort(arr: list[int]) -> list[int]:
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quick_sort(left) + middle + quick_sort(right)
Then benchmark it in your tests:
tests/benchmarks/test_algorithm_performance.py
from mylib.algorithms import quick_sort
import pytest
@pytest.mark.parametrize("size", [10, 100, 1000])
def test_quick_sort_performance(benchmark, size):
# NOT MEASURED: Create test data
data = list(range(size, 0, -1))
# MEASURED: The sorting algorithm
result = benchmark(quick_sort, data)
# NOT MEASURED: Verify correctness
assert result == list(range(1, size + 1))
Make sure your package is installed in development mode:
uv pip install -e .
Running Benchmarks Continuously with CodSpeed
So far, youβve been running benchmarks locally. But local benchmarking has limitations:
- Inconsistent hardware: Different developers get different results
- Manual process: Easy to forget to run benchmarks before merging
- No historical tracking: Hard to spot gradual performance degradation
- No PR context: Canβt see performance impact during code review
This is where CodSpeed comes in. It runs your benchmarks automatically in CI and provides:
- Automated performance regression detection in PRs
- Consistent metrics with reliable measurements across all runs
- Historical tracking to see performance over time with detailed charts
- Flamegraph profiles to see exactly what changed in your codeβs execution
How to set up CodSpeed with pytest-codspeed
Hereβs how to integrate CodSpeed with your pytest-codspeed benchmarks:
1
2
3
Next Steps
Check out these resources to continue your Python benchmarking journey: