The TensorPool Agent is an autonomous monitoring and recovery system for long-running distributed training jobs on Kubernetes, Slurm, or TensorPool Jobs. It’s designed for large multi-node training jobs that run for days to weeks. When the TensorPool Agent detects a runtime error, it attempts to autonomously recover your training job from its last checkpoint. You explicitly whitelist the actions the TensorPool Agent can take on your behalf. Best case: The TensorPool Agent recovers your training job when you are AFK, letting you get more iteration cycles and avoid burning GPU hours. Worst case: The TensorPool Agent delivers a preliminary root cause analysis and the actions it would have taken.
Target Failures
The TensorPool Agent is designed to address runtime errors that…
The TensorPool Agent is an autonomous monitoring and recovery system for long-running distributed training jobs on Kubernetes, Slurm, or TensorPool Jobs. It’s designed for large multi-node training jobs that run for days to weeks. When the TensorPool Agent detects a runtime error, it attempts to autonomously recover your training job from its last checkpoint. You explicitly whitelist the actions the TensorPool Agent can take on your behalf. Best case: The TensorPool Agent recovers your training job when you are AFK, letting you get more iteration cycles and avoid burning GPU hours. Worst case: The TensorPool Agent delivers a preliminary root cause analysis and the actions it would have taken.
Target Failures
The TensorPool Agent is designed to address runtime errors that occur deep into training:
- GPU hardware faults: Xid errors (79, 63, 48, etc.)
- Distributed communication failures, NCCL errors
- Infrastructure problems: hardware failures, kernel panics
- Storage problems: I/O errors, checkpoint corruption, S3 timeouts
- Network problems: mounted object storage bucket issues
- GPU memory problems: CUDA out of memory, memory leaks, gradient explosion
How It Works
- Registration: Provide credentials to access your job scheduler of choice (Slurm, K8s, or TensorPool Jobs) on the TensorPool Agent dashboard. Whitelist permissions you allow the agent to take on your behalf.
- Monitoring: The training job is continuously monitored for failure.
- Recovery (if job fails): The TensorPool Agent analyzes logs, attempts to diagnose and fix the issue. The job enters a
recoveringstate. - Resolution: If recovery succeeds, monitoring resumes. You’re alerted about the failure, actions taken, and recovery status. If the TensorPool Agent lacks permissions, it provides a list of actions it attempted and would have tried.
TensorPool Agent Status Lifecycle
| Status | Description |
|---|---|
| pending | TensorPool Agent created, credentials being validated |
| enabled | TensorPool Agent is monitoring the job |
| credential_error | Credential validation failed, job is not accessible by the TensorPool Agent, fix and resubmit |
| recovering | Job failure detected, TensorPool Agent is attempting to recover it |
| completed | Job finished (succeeded or unrecoverable) |
You will be notified via text or email whenever the TensorPool Agent enters the recovering state.
Failure Detection
The TensorPool Agent has the following definitions of failure for each job scheduler:
TensorPool Jobs
Kubernetes
Slurm
Only jobs in ERROR state trigger the TensorPool Agent.
| K8s Resource Kind | Failure Condition |
|---|---|
| Job | status.failed >= 1 |
| Deployment | status.availableReplicas == 0 |
| StatefulSet | status.readyReplicas == 0 |
| DaemonSet | status.numberReady == 0 |
| Pod | status.phase in (Failed, Unknown) or resource not found |
| KubeFlow PyTorchJob/TFJob/MPIJob | Kubeflow Failed condition or no active/succeeded replicas |
Failure states (triggers TensorPool Agent):
FAILEDTIMEOUTNODE_FAILOUT_OF_MEMORY
Non-failure states (does not trigger TensorPool Agent):
CANCELLEDPREEMPTEDDEADLINE
Setup Requirements
The information that has to be provided in order for the TensorPool Agent to monitor a job depends on the job scheduler.
TensorPool Jobs
Kubernetes
Slurm
The simplest option - just provide your TensorPool job ID.
| Field | Description |
|---|---|
| Job ID | Your TensorPool job ID |
| Field | Description |
|---|---|
| Kubeconfig | Valid Kubernetes kubeconfig file |
| Job YAML | Valid Kubernetes resource manifest |
Kubeconfig Requirements:
apiVersion: v1
kind: Config
clusters:
- name: my-cluster
cluster:
server: https://...
contexts:
- name: my-context
context:
cluster: my-cluster
user: my-user
users:
- name: my-user
user:
token: ...
Job YAML Requirements:
apiVersion: apps/v1
kind: Job # Supported: Job, Deployment, StatefulSet, DaemonSet, Pod, PyTorchJob, TFJob, MPIJob
metadata:
name: my-training-job # generateName is NOT supported
namespace: default
spec:
# ...
| Field | Description |
|---|---|
| Login node IP | IP address of the Slurm login node |
| SSH port | SSH port for the Slurm login node |
| SSH username | Username for SSH access to the Slurm login node |
| SSH private key | Private key for SSH authentication to the Slurm login node |
| Slurm job ID | ID of the job to monitor |
Next Steps
- Set up the TensorPool Agent on the dashboard
- Learn about TensorPool Jobs for running training workloads