The GPU rental market is a bit of a mess.
First, not even big cloud providers can guarantee GPU availability at all times and places. Sure, you could keep the VM running for months, but that would burn your budget quite quickly, especially if you are a solo developer or work in a small team.
Second, permanent storage is severely limited. Some providers offer persistent drives whose lifecycles are independent of the VMs. But these drives are still usually bound to a specific region. What if that region is currently out of GPUs? That drive is now essentially useless.
I’ve been thinking about how to organize my workflow around these limitations. Here is what I have come up with so far.
Rsync to the rescue
At my job, I have to fine-tune and train a lot of models. My company h…
The GPU rental market is a bit of a mess.
First, not even big cloud providers can guarantee GPU availability at all times and places. Sure, you could keep the VM running for months, but that would burn your budget quite quickly, especially if you are a solo developer or work in a small team.
Second, permanent storage is severely limited. Some providers offer persistent drives whose lifecycles are independent of the VMs. But these drives are still usually bound to a specific region. What if that region is currently out of GPUs? That drive is now essentially useless.
I’ve been thinking about how to organize my workflow around these limitations. Here is what I have come up with so far.
Rsync to the rescue
At my job, I have to fine-tune and train a lot of models. My company has a Lambda.ai account, which I can use to access some heavy-duty compute. Unfortunately, Lambda suffers from the two issues I just described. GPU availability is volatile, their persistent storage is bound to a specific region, and at $0.20 per GB a month, it is also quite expensive.
I don’t want to burn my company’s money by keeping the VM running permanently, but I also need to quickly access my code and data on Lambda.
After some exploration, I came up with a solution based on Rsync. The process is simple:
- I boot up a machine with a GPU in any region where it’s currently available
- I use Rsync to copy the codebase and required training data to the machine
- I install UV and use it to create the virtual environment
- I run whatever task I need
- I run Rsync again to copy the results back to my local computer
Compared to alternatives like scp, Rsync offers an important advantage. If I change my code on my local machine, Rsync only transfers files that were actually modified. If something doesn’t work as expected, I can quickly edit my code, sync it, and run it again.
I asked Claude to write a simple bash script that takes the IP of the Lambda VM as an argument and then:
- Syncs the codebase in the local machine -> lambdadirection
- Syncs data in the local machine -> lambdadirection
- Syncs results in the lambda -> local machinedirection
Here is how it looks:
#!/bin/bash
# Check if IP address argument is provided
if [ $# -eq 0 ]; then
echo "Error: IP address required"
echo "Usage: $0 <ip-address>"
exit 1
fi
IP=$1
REMOTE_USER="ubuntu"
# Sync monolith directory (local -> remote), excluding .venv
echo "Syncing monolith to remote..."
rsync -avz --exclude='.venv' code/ ${REMOTE_USER}@${IP}:~/code/
if [ $? -ne 0 ]; then
echo "Error: Failed to sync code directory"
exit 1
fi
# Sync data directory (local -> remote)
echo "Syncing data to remote..."
rsync -avz data/ ${REMOTE_USER}@${IP}:~/data/
if [ $? -ne 0 ]; then
echo "Error: Failed to sync data to remote"
exit 1
fi
# Sync data directory (remote -> local)
echo "Syncing data from remote..."
rsync -avz ${REMOTE_USER}@${IP}:~/data/ data/
if [ $? -ne 0 ]; then
echo "Error: Failed to sync data from remote"
exit 1
fi
echo "All syncs completed successfully!"
Whenever I need to sync something, I just invoke this script. I don’t need to remember the exact arguments, and it always transfers only files that actually changed.
Limitations
There are two limiting factors to consider when using this solution: the amount of data to transfer and your internet speed. I mostly work with datasets with sizes ranging from 100s of megabytes to a few gigabytes, and my maximum upload speed is about 500 Mbit/s. The Rsync solution works quite well for such a use case. If your connection is slow or if you work with much larger datasets, you might need to find a different solution.