PyTorch team unveils framework for programming clusters

The PyTorch team at Meta, stewards of the PyTorch open source machine learning framework, has unveiled Monarch, a distributed programming framework intended to bring the simplicity of PyTorch to entire clusters. Monarch pairs a Python-based front end, supporting integration with existing code and libraries such as PyTorch, and a Rust-based back en…

Announced October 22, Monarch is a framework based on scalable actor messaging that lets users program distributed systems the way a single machine would be programmed, thus hiding the complexity of distributed computing, the PyTorch team said. Monarch is currently in an experimental stage; installation instructions can be found at meta-pytorch.org.

Monarch organizes processes, actors, and hosts into a scalable multidimensional array, or mesh, that can be manipulated directly. Users can operate on entire meshes, or slices of them, with simple APIs, with Monarch handling distribution and vectorization automatically. Developers can write code as if nothing fails, according to the PyTorch team. But when something does fail, Monarch fails fast by stopping the whole program. Later on, users can add fine-grained fault handling where needed, catching and recovering from failures.

Monarch splits control plane messaging from data plane transfers, enabling direct GPU-to-GPU memory transfers across a cluster. Commands are sent through one path while data moves through another. Monarch integrates with PyTorch to provide tensors that are sharded across clusters of GPUs. Tensor operations look local but are executed across distributed large clusters, with Monarch handling the complexity of coordination across thousands of GPUs, the PyTorch team said.

The PyTorch team warned that in Monarch’s current stage of development, users should expect bugs, incomplete features, and APIs that may change in future versions.

Similar Posts