Learning from software errors – Part 3: A Mars probe gets out of control

Concurrency is omnipresent in modern software development. Even small applications often run on systems with multiple cores, interact with databases, wait for network responses, or share resources such as files and memory areas. In distributed systems and embedded software, there is also the fact that different processes have to react to each other, often under real-time conditions. Practice shows that as soon as several things can happen at the same time, new error classes arise that would never have appeared in serial programmes.

Golo Roden

Golo Roden is the founder and CTO of the native web GmbH. He works on the design and developm…

Golo Roden

Golo Roden is the founder and CTO of the native web GmbH. He works on the design and development of web and cloud applications and APIs, with a focus on event-driven and service-based distributed architectures. His guiding principle is that software development is not an end in itself but must always follow an underlying technical expertise. Parts of the series “Learning from software errors”:

Teil 1: Wenn Zahlen irreführend werden

Pattern 3: Concurrency and scheduling: when processes block each other

A famous example is the Mars Pathfinder, a NASA mission from 1997. The landing itself was a spectacular success. The probe touched down gently on Mars and began transmitting data. Shortly afterward, however, there were sporadic system crashes and automatic resets, which put the team on the ground on alert.

The cause was a priority inversion—a classic concurrency problem. In a real-time operating system, there are tasks with different priorities. High-priority means: This task should run as soon as possible as soon as it is ready. Low priority must not block it.

One such high-priority task was running on the Pathfinder, processing data from the weather sensor. However, it required access to a shared resource—in this case, a mutex that was held by a low-priority task. This low-priority task was in turn constantly displaced by a medium-priority task. The result: the high-priority task was indirectly waiting for a low-priority task that never came to fruition.

This phenomenon of “reversal of priorities” meant that the system got stuck in certain load situations and eventually restarted. The solution was simple in principle: the developers activated priority inheritance in the VxWorks real-time operating system. As a result, the blocking, low-priority task temporarily inherited the high priority as soon as a higher-priority task was waiting for it. The node unblocked, and the crashes disappeared.

This example is instructive because it illustrates several typical patterns:

Concurrency errors are difficult to reproduce; they often only occur under certain load profiles.
Redundancy or repetitions do not automatically help: if the error is in the design, it affects all instances equally.
The smallest details in scheduling can make all the difference: The software can run correctly a thousand times and fail the thousand-and-first time.

In modern applications, similar problems can occur in the form of deadlocks, race conditions, or livelocks. These usually do not show up in the local test run, but only in production, when real load and real parallelism take effect. But how can such errors be avoided?

Clear lock hierarchies: If several resources are locked, they should always be locked in the same order.
Use priority protocols: Mechanisms such as priority inheritance or priority ceiling are available in many real-time operating systems and even in modern frameworks.
Decouple concurrency: Instead of locking common states directly, architectures with message passing or actor models can avoid race conditions.
Deterministic tests and simulations: specialized test frameworks can specifically delay processes or manipulate schedulers to make rare conflicts reproducible.
Telemetry and monitoring: It should also be visible during operation if locks are held for an unusually long time.

For teams that develop web backends or cloud services, the same danger exists, only in a slightly different form: database transactions, distributed caches, or competing API requests can have similar effects. A slow background process blocks a lock, while a flood of parallel requests escalates this situation.

The lesson learned from the Pathfinder incident is therefore timeless: concurrency is not a free performance booster but a complex system that developers must explicitly design and monitor. Anyone who treats concurrency as a side issue will sooner or later encounter errors that are difficult to reproduce and potentially catastrophic.

Learning from software errors—the series

This series of articles presents nine typical error classes that occur again and again in practice, regardless of industry or technology. In each category, the series will present a specific example, analyze its causes, and deduce what software developers can learn from it in the long term.

In the next installment, you can read: Time, calendar, and geography: When the clock doesn’t measure what you think it does.

(mack)

Don’t miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.

Pattern 3: Concurrency and scheduling: when processes block each other

Learning from software errors—the series

Similar Posts