We’ve been running a microservices platform (mostly Node.js/Python services) across about 20 production instances, and our deployment process was becoming a real bottleneck. We were seeing failures maybe 3-4 times per week, usually human error or inconsistent processes.
I spent some time over the past quarter building out better automation around our deployment pipeline. Nothing revolutionary, but it’s made a significant difference in reliability.
The main issues we were hitting:
Services getting deployed when system resources were already strained
Inconsistent rollback procedures when things went sideways
Poor visibility into deployment health until customers complained
Manual verification steps that people would skip under pressure Approach:
Built this into our…
We’ve been running a microservices platform (mostly Node.js/Python services) across about 20 production instances, and our deployment process was becoming a real bottleneck. We were seeing failures maybe 3-4 times per week, usually human error or inconsistent processes.
I spent some time over the past quarter building out better automation around our deployment pipeline. Nothing revolutionary, but it’s made a significant difference in reliability.
The main issues we were hitting:
Services getting deployed when system resources were already strained
Inconsistent rollback procedures when things went sideways
Poor visibility into deployment health until customers complained
Manual verification steps that people would skip under pressure Approach:
Built this into our existing CI/CD pipeline (we’re using GitLab CI). The core improvement was making deployment verification automatic rather than manual.
Pre-deployment resource check:
#!/bin/bash
cpu_usage=$(ps -eo pcpu | awk 'NR>1 {sum+=$1} END {print sum}')
memory_usage=$(free | awk 'NR==2{printf "%.1f", $3*100/$2}')
disk_usage=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if (( $(echo "$cpu_usage > 75" | bc -l) )) || [ "$memory_usage" -gt 80 ] || [ "$disk_usage" -gt 85 ]; then
echo "System resources too high for safe deployment"
echo "CPU: ${cpu_usage}% | Memory: ${memory_usage}% | Disk: ${disk_usage}%"
exit 1
fi
The deployment script handles blue-green switching with automatic rollback on health check failure:
#!/bin/bash
SERVICE_NAME=$1
NEW_VERSION=$2
HEALTH_ENDPOINT="http://localhost:${SERVICE_PORT}/health"
# Start new version on alternate port
docker run -d --name ${SERVICE_NAME}_staging \
-p $((SERVICE_PORT + 1)):$SERVICE_PORT \
${SERVICE_NAME}:${NEW_VERSION}
# Wait for startup and run health checks
sleep 20
for i in {1..3}; do
if curl -sf http://localhost:$((SERVICE_PORT + 1))/health; then
echo "Health check passed"
break
fi
if [ $i -eq 3 ]; then
echo "Health check failed, cleaning up"
docker stop ${SERVICE_NAME}_staging
docker rm ${SERVICE_NAME}_staging
exit 1
fi
sleep 10
done
# Switch traffic (we're using nginx upstream)
sed -i "s/localhost:${SERVICE_PORT}/localhost:$((SERVICE_PORT + 1))/" /etc/nginx/conf.d/${SERVICE_NAME}.conf
nginx -s reload
# Final verification and cleanup
sleep 5
if curl -sf $HEALTH_ENDPOINT; then
docker stop ${SERVICE_NAME}_prod 2>/dev/null || true
docker rm ${SERVICE_NAME}_prod 2>/dev/null || true
docker rename ${SERVICE_NAME}_staging ${SERVICE_NAME}_prod
echo "Deployment completed successfully"
else
# Rollback
sed -i "s/localhost:$((SERVICE_PORT + 1))/localhost:${SERVICE_PORT}/" /etc/nginx/conf.d/${SERVICE_NAME}.conf
nginx -s reload
docker stop ${SERVICE_NAME}_staging
docker rm ${SERVICE_NAME}_staging
echo "Deployment failed, rolled back"
exit 1
fi
Post-deployment verification runs a few smoke tests against critical endpoints:
#!/bin/bash
SERVICE_URL=$1
CRITICAL_ENDPOINTS=("/api/status" "/api/users/health" "/api/orders/health")
echo "Running post-deployment verification..."
for endpoint in "${CRITICAL_ENDPOINTS[@]}"; do
response=$(curl -s -o /dev/null -w "%{http_code}" ${SERVICE_URL}${endpoint})
if [ "$response" != "200" ]; then
echo "Endpoint ${endpoint} returned ${response}"
exit 1
fi
done
# Check response times
response_time=$(curl -o /dev/null -s -w "%{time_total}" ${SERVICE_URL}/api/status)
if (( $(echo "$response_time > 2.0" | bc -l) )); then
echo "Response time too high: ${response_time}s"
exit 1
fi
echo "All verification checks passed"
Results:
Deployment failures down to maybe once a month, usually actual code issues rather than process problems
Mean time to recovery improved significantly because rollbacks are automatic
Team is much more confident about deploying, especially late in the day The biggest win was making the health checks and rollback completely automatic. Before this, someone had to remember to check if the deployment actually worked, and rollbacks were manual.
We’re still iterating on this - thinking about adding some basic load testing to the verification step, and better integration with our monitoring stack for deployment event correlation.
Anyone else working on similar deployment reliability improvements? Curious what approaches have worked for other teams.