Hybrid MARL + Linear Programming Architecture for Logistics Scheduling

Part 1. Hybrid Solution for Dynamic Vehicle Routing — Context and Architecture

14 min read1 day ago

–

Introduction

Logistics is a giant business that often operates with surprising inefficiency: manual processes, piles of paperwork, legal complexities. Many companies still run on paper or Excel and don’t even collect data on their shipments.

But what if a company is large enough to save millions — or even hundreds of millions — of dollars through optimization (to say nothing of the environmental impact)? Or what if a company is small, but poised for rapid growth?

Press enter or click to view image in full size

Shipment Movements in Logistics Network Simulation

Optimization is often non-existent or rudimentary — designed for operational convenience rather than maximizing…

Part 1. Hybrid Solution for Dynamic Vehicle Routing — Context and Architecture

14 min read1 day ago

–

Introduction

Press enter or click to view image in full size

Shipment Movements in Logistics Network Simulation

Optimization is often non-existent or rudimentary — designed for operational convenience rather than maximizing savings. The industry is clearly lagging behind, yet there is a TON of money on the table. Shipment networks span the globe, from Alaska to Sydney. I won’t bore you with market size statistics here. Insiders already know the scale, and outsiders can make an educated (or not so educated) guess.

And that is where I came in. As a Data Science and Machine Learning specialist, I found myself in a large, fast-growing logistics company. Crucially, the team there wasn’t just going through the motions; they genuinely wanted to optimize. This led to the creation of a line-haul optimization project that I led for two years — and that is the story I am here to tell.

This project will always hold a warm spot in my heart, even though it never fully made it to production. I believe it holds massive potential — specifically in the combination of logistics and RL’s unique ability to generalize decision-making.

While traditional optimization projects usually focus on maximizing the objective function or execution speed, the most interesting metric here is how many unseen cases we can solve with the same model (zero-shot or few-shot).

In other words, we are aiming for a generalizable zero-shot policy.

Ideally, we train an agent, drop it into new conditions (ones it has never seen), and it just works— without any retraining or with only minimal fine-tuning. We don’t need perfection; we just need it to perform ‘good enough’ to not breach the SLA.

Then we can say: ‘Cool, the agent generalized this case, too.’

I am confident that this approach can yield models capable of ever-increasing generalization over time. I believe this is the future of the industry.

And as one of my favorite stand-up comedians once said:

Eventually, somebody will do it anyway. Let it be us.

Business Context

The company had scaled rapidly, growing into a network of over 100 line-haul terminals. At this magnitude, manual scheduling reached its operational limit. Once established, a schedule — along with its underlying business contracts and arrangements — would often remain static for months without a single change.

We observed a consistent inefficiency: trucks were frequently dispatched with suboptimal loads — either underutilized (driving up unit costs) or bottlenecked by last-minute overflows.

The financial impact of this inefficiency was significant. In a network of this size, even a 1% increase in vehicle utilization translates to millions of dollars in annual savings. Therefore, maximizing vehicle utilization became the primary lever for cost reduction.

Big Picture Problem

We had access to historical shipment data. While the storage format was far from convenient, the volume was sufficient for modeling. Thanks to the efforts of my data engineering and data science colleagues, this raw data was transformed into a clean, usable state (I will cover the specific data engineering challenges in a separate article).

My initial goal was to generate a ‘good’ schedule. A Schedule is defined here as a tabular dataset where every row represents a physical movement (shipment):

Timestamp: Hourly precision.
Origin & Destination: The specific edge in the graph.
Vehicle Type: The discrete asset class (e.g., 20-ton semi, 5-ton van, etc.).
Load Manifest: The particular set of aggregated ‘pallets’ packed inside.

Therefore, building a schedule requires four distinct decisions:

Choose what packages to send. What can go wrong: if low-priority packages are sent first, valuable or urgent cargo might get stranded at the warehouse. We don’t want that, because the penalty is higher for the more valuable packages.
Choose the next warehouse (where to ship). Essentially, this is a routing problem: selecting the optimal ‘next edge’ on the graph for every single package.
Choose vehicle types and their quantity. This is a balancing act. What can go wrong: sending multiple small vehicles instead of one large one creates fleet inefficiency, while dispatching large trucks that drive mostly empty means paying for air. Conversely, under-provisioning the fleet leads to delays, costing us in both SLA penalties and reputation.
Finally, inaction is also an action. For any given time step, the optimal move might be to send no trucks at all. To create an optimized schedule, the system must perfectly balance active shipments with ‘doing nothing’.

However, reality introduces additional complexities and constraints into the problem space:

**Pace of Change: **Business rules are numerous, **complex, **and **evolve rapidly. **The real world can be far more complex and messier than a basic simulation. And changes in the real world lead to expensive and time-consuming code updates.
**Stochastic Demand: **Demand is non-deterministic, unknown in advance, and dynamic (e.g., multiple visits to a customer within a window).
Multi-Objective Optimization: We aren’t just minimizing cost; we are balancing cost against SLA penalties (lateness) and fleet expenses.

So now, we understand that we not only need to create a good schedule, but also create a system that respects dynamic demand, truck capacity, and numerous custom business rules, which can also often change. This crystallized into the following.

Wish-List

Low-Cost Reusability. We need the ability to reuse the mechanism for new tasks and contexts cheaply. Since real-world problems shift quickly, the solution must be versatile — adaptable to new settings without requiring us to retrain the model from scratch every time.
Fast Inference. While slow training is acceptable if it yields stronger generalization, the inference (decision-making) must be fast.
‘Good Enough’ Effectiveness. The system doesn’t need to be perfect, but it must strictly adhere to the baseline SLA levels.
Global Optimization. We need to optimize the system as a whole, rather than optimizing its individual components in isolation.

System Specifications

Topology: Custom graph containing 2 to 100 nodes
Decision frequency: 1-hour intervals, 480 steps/episode (representing 20 days)
Agents: Decentralized hubs acting as independent decision-makers
Constraints: Hard physical limits on vehicle volume (m³) and weight (kg). Hard limit on the number of vehicles dispatched from a terminal per hour.
Objective: Minimize global cost while adhering to dynamic SLA windows.
**Primary metrics: **Shipments cost, percentage of late packages (SLA violations), count of dispatched vehicles by type
**Secondary “Long-term” Metrics: **Average transit time and vehicle capacity utilization.

Why Not Standard Solvers?

Spoiler: They can’t cut it, and they are not good enough.

Naturally, we started by exploring standard solvers and off-the-shelf tools like Google OR-Tools. However, the consensus was discouraging: these tools would either solve our actual problem poorly, or they would perfectly solve a different, imaginary version of the problem. Ultimately, I concluded that this approach was a dead end.

Linear Optimization

This is the simplest and cheapest approach, but it has a fatal flaw: a linear formulation fails to account for temporal dynamics (every other step depends on the previous one).

Essentially, LP assumes the entire optimization problem fits into a single, static snapshot. It ignores the fact that every step depends on the previous one. This is fundamentally incorrect and divorced from reality, where every movement in the network creates ripple effects elsewhere.

Furthermore, the sheer volume of business rules makes it practically impossible to cram them all into a “flat” solver. In short, while Linear Programming is a great tool, it is simply too rigid for a problem of this magnitude.

Genetic Algorithms

Genetic Algorithms (GA) were closer in philosophy to what we needed. While they do work, they come with significant drawbacks of their own.

First, slow Inference. To get a result, you essentially have to run the optimization from scratch every time (evolving the population). You cannot simply “train” a model and freeze the weights, because there are no weights to freeze. Consequently, the system’s reaction time is measured in seconds or even minutes — not milliseconds — typical of a neural network or a heuristic. In a production environment dealing with hundreds of hubs in real-time, this becomes a major bottleneck.

Second, lack of determinism. If you run the scheduler twice on the same dataset, a GA can yield two completely different schedules. Business customers usually don’t like that very much, which can lead to trust issues.

Why not Pure RL?

Theoretically, one could try to solve the entire problem end-to-end using pure Reinforcement Learning. But that is definitely the hard way.

A potential pure RL solution would take one of two forms: either a single “God Mode” Agent that sees everything and allocates every package to every truck on every route at every step. Or a team of Sequential Agents acting one after another.

God Mode Agent

In the first case, the action space becomes unmanageable. You aren’t just selecting a route — you have to choose every truck (from N types) K times for every direction. With packages, it gets even worse: you don’t just need to select a subset of cargo — you have to assign specific packages to specific trucks. Plus, you retain the option to leave a package at the warehouse.

Even with a small fleet, the number of ways to assign specific packages to specific trucks is astronomical. Asking a neural network to explore this entire space from scratch is inefficient. It would spend eons just trying to figure out which package fits into which bin.

Sequential Agents

A chain of agents passing packages down the line would create a non-stationarity nightmare.

While Agent 1 is learning, its behavior is essentially random. Agent 2 tries to adapt to Agent 1, but since Agent 1 keeps changing its strategy, Agent 2 can never stabilize. Instead of solving logistics, each agent is forced to infinitely adapt to its neighbor’s instability. It becomes a case of the blind leading the blind, unlikely to converge in any reasonable time.

Furthermore, pure RL struggles to learn hard constraints (like maximum weight limits) without incurring massive penalties. It tends to “hallucinate” solutions — outputs that look efficient but are physically impossible.

On the other hand, we have Linear Programming (LP): a fast, simple solver that handles hard constraints natively. The temptation to carve out a sub-problem and offload it to LP was too great to resist.

And that is why I chose a hybrid approach.

Implemented Solution

MARL + LP Hybrid Architecture

Let’s build an RL agent that observes the state of the logistics network and orchestrates the flow of packages — deciding exactly what volume of cargo moves between warehouses at any given moment. Ideally, this agent makes decisions strategically, factoring in the global state of the system rather than just optimizing individual warehouses in isolation.

Then, an Agent represents a specific warehouse responsible for shipping packages to its neighbors. We then connect these agents into a multi-agent network. Since every action taken by an agent corresponds to a shipment to one or more destinations, the aggregate sequence of these actions constitutes the final schedule.

Technically, we implemented a Multi-Agent Reinforcement Learning (MARL) framework. The RL environment trains the algorithms to generate viable transportation schedules for real-world shipments. Crucially, this project includes both the environment creation and the agent training pipelines, ensuring that the solution can adapt (via continual learning) to increasingly complex scenarios with minimal human intervention.

What agents see

Below are the key observations (model inputs) fed into the agent (I will cover more of the implementation details in Part 2).

Local Inventory: The quantity of packages at each warehouse.
In-Transit Volume: The quantity of packages currently traveling on the edges between warehouses.
Cargo Value: The total financial value of the inventory (crucial for risk management) at each warehouse.
SLA Heatmap: The nearest deadlines for the current stock (identifying urgent cargo).
Inbound Forecast: The volume of packages expected to arrive within the next 24 hours.
Heuristic Hints: Used exclusively during the imitation learning stage to bootstrap training.

Version 1. Agents Slicing a PriorityQueue

In this version, packages are lined up in a priority queue, sorted in descending order based on a simple formula: *Priority *= Value x *Urgency *(proximity to deadline). The RL agent “slices” a portion of this queue by selecting a fraction of the top packages and deciding **which warehouse **to send them to.

We use heuristics to pre-filter the options — discarding packages we definitely don’t want to send yet, or ruling out nonsensical destinations (e.g., shipping a package in the opposite direction of its destination).

Once the RL selects the what and where, the Linear Programming solver steps in to pick the quantity and type of vehicles. The LP enforces hard constraints on weight, volume, and fleet availability to ensure the simulation doesn’t violate the laws of physics.

In Version 1, a single action consists of sending packages to one neighbor only. The volume is determined by the “fraction” (0.0 to 1.0) selected by the agent. “Doing nothing” is simply choosing a fraction of 0.

Press enter or click to view image in full size

Figure 1: V1 Architecture — The Agent tries to micromanage the queue

But then, it hit me!

Version 2. Agents Sending Trucks

TL;DR: Instead of selecting packages, we built an agent that selects how many trucks to dispatch to each destination. The Linear Programming (LP) solver then decides exactly which packages to pack into those trucks.

What if the agent controlled the fleet capacity directly? This allows the LP solver to handle the low-level “bin packing” work, while the RL agent focuses purely on high-level flow management. This is exactly what we needed!

Here is the new division of labor:

RL Agent — Fleet Manager. Decides the quantity of vehicles and their destinations.

Intuition: It looks at the map, checks the calendar, and shouts: “Send 5 trucks to the North Hub!” It handles the flow management.
Skill: Strategy, foresight, and balancing.

LP Solver — Dock Worker. Selects the specific vehicle types (optimizing the fleet mix) and picks the specific packages to pack.

Intuition: It takes the “5 trucks” order and the pile of boxes, then packs them perfectly to maximize value density.
Skill: Tetris, algebra, and physical validity.

Previously, the agent controlled a “fraction of the queue,” which determined the package count, which determined the truck count, which finally determined the reward. Now, the agent controls the truck count directly. The link between Action and Reward became much shorter and more predictable, making training faster and more stable. In technical terms, we significantly reduced the stochastic noise in the reward signal. The LP now optimizes only the packaging and fleet mix after the strategic capacity decision has already been made.

But the engineering benefits didn’t stop there. Since the LP now selects the packages, we no longer need to maintain a sorted Priority Queue. This simplified the architecture in three critical ways. First is concurrency: We eliminated the technical multiprocessing headaches associated with sharing complex PriorityQueue objects between processes. Second is vectorization: We no longer have to iterate through a queue item-by-item (a slow Python loop). We can now rewrite everything using matrix operations. This unlocked a massive potential for speed optimization. Plus, the code became significantly shorter and cleaner. And finally, multi-destination actions: The agent can now dispatch X trucks to N different warehouses in a single step (unlike V1, which was limited to one destination per step). It became immediately clear that this was the winning architecture.

Press enter or click to view image in full size

Figure 2: V2 Architecture — The “Fleet Manager” Approach

Scale-Invariant Observation Space and Generalization

TL;DR: I use histogram state representations normalized to 0–1 instead of absolute values to make the agents transferable to new cases.

A core pillar of this project’s philosophy is **universality **— the ability to reuse the solution across different tasks and new conditions without retraining. However, standard RL requires a rigidly fixed action and observation space.

To reconcile this, we normalized the observation space to make it scale-invariant. Instead of tracking raw counts (e.g., “how many packages were sent”), we track ratios (e.g., “what percentage of the total backlog was sent”). This allows the agent to operate on a higher level of abstraction where absolute numbers are irrelevant.

The result is a model capable of generalizing across different scenarios, enabling zero-shot transfer across nodes with vastly different capacities.

A Glimpse of the Performance

Agents Learned “LTL Consolidation” Behavior

TL;DR: Increased shipment cost led to more idle actions and fewer vehicles.

One of the most impressive emergent behaviors was the agents’ ability to perform LTL (Less-Than-Truckload) Consolidation. At the beginning of training, the agents were trigger-happy, dispatching many partially filled trucks at every step. Over time, their behavior shifted.

The shipment cost is calculated as a product of the vehicle cost and the shipment cost multiplier. When the shipment cost multiplier increases, a shipment costs more in relation to the value of the packages. That gives us a simple way to adjust the shipment cost part of the reward manually.

Figure 3: Total number of vehicles sent by an agent. One point — one “20-day” episode

As we increased the shipment cost multiplier (making logistics more expensive relative to the package value), the agents learned to be patient. They began choosing more “idle” actions, effectively accumulating inventory to send fewer, fuller trucks.

Figure 4: Total agent reward. One point — one “20-day” episode

Because it is costly to send a truck half-empty (or half-full, depending on your worldview), agents started waiting to fill the trucks closer to 100% capacity. In other words, the agents learned to optimize vehicle utilization indirectly, purely as a byproduct of the cost/reward function.

On the other hand, sending fewer cars led to a higher number of overdue packages. I believe this kind of trade-off — cost vs. speed — should be decided by each business independently, based on their specific strategy and SLAs. In our specific case, we had a hard cap on the percentage of allowed delays, hence we could optimize by staying below that cap.

More results and experiments will be shown in the coming Part 3

Constraints and Benefits

As I mentioned earlier, high-quality data is crucial for this engine. If you don’t have data, you have no simulation, no schedules, and no package flow forecasts — the very foundation of the entire system.

You also need the willingness to adapt your business processes. In practice, this is often met with resistance. And, of course, you need the raw compute power (substantial RAM + CPU) to run the simulations.

But if you can overcome these hurdles, you might find that your logistics network has transformed into something much more powerful — a network that:

Can withstand overloads, peak seasons, and sudden events. This is because you have a fast, reliable way to generate a new schedule instantly by simply applying your pre-trained agents to the new data.
Is more efficient than the competition. MARL has the potential to achieve not just local optimization, but global optimization of the entire network over a continuous time horizon.
Can rapidly expand or contract as needed. This flexibility is achieved precisely through the model’s **generalization **capabilities.

All the best to everyone, and may your shipments always be fast and reliable!

See the upcoming Part 2 for the implementation specifics and tricks I used to make this work!

LinkedIn | E-mail

Part 1. Hybrid Solution for Dynamic Vehicle Routing — Context and Architecture

Introduction

Part 1. Hybrid Solution for Dynamic Vehicle Routing — Context and Architecture

Introduction

Business Context

Big Picture Problem

Wish-List

System Specifications

Why Not Standard Solvers?

Linear Optimization

Genetic Algorithms

Why not Pure RL?

God Mode Agent

Sequential Agents

Implemented Solution

MARL + LP Hybrid Architecture

What agents see

Version 1. Agents Slicing a PriorityQueue

Version 2. Agents Sending Trucks

Scale-Invariant Observation Space and Generalization

A Glimpse of the Performance

Agents Learned “LTL Consolidation” Behavior

Constraints and Benefits

Similar Posts