If you’ve worked with data science or machine learning, you already know this part is not fun:
Installing Python packages Fixing dependency conflicts Matching library versions Repeating the same setup on every new machine Before you even write your first line of actual ML code, you’ve already burned an hour.
In this post, I’ll walk through:
what a practical data science environment actually needs common mistakes people make during setup and one clean way to avoid the whole mess on cloud VMs This is written from a hands-on infrastructure perspective.
What a Real Data Science Environment Needs
A usable data science setup is more than “Python installed”.
At minimum, you usually need:
Core data & numerical stack
NumPy Pandas SciPy Visualization
Matplotlib Seaborn Plotly Machine lea…
If you’ve worked with data science or machine learning, you already know this part is not fun:
Installing Python packages Fixing dependency conflicts Matching library versions Repeating the same setup on every new machine Before you even write your first line of actual ML code, you’ve already burned an hour.
In this post, I’ll walk through:
what a practical data science environment actually needs common mistakes people make during setup and one clean way to avoid the whole mess on cloud VMs This is written from a hands-on infrastructure perspective.
What a Real Data Science Environment Needs
A usable data science setup is more than “Python installed”.
At minimum, you usually need:
Core data & numerical stack
NumPy Pandas SciPy Visualization
Matplotlib Seaborn Plotly Machine learning
Scikit-learn XGBoost / LightGBM / CatBoost Deep learning (CPU or GPU)
PyTorch TensorFlow / Keras Notebooks & dev tools
JupyterLab IPython Requests, tqdm, etc. Database connectivity
Most real projects also pull data from:
PostgreSQL / MySQL MongoDB Which means you need client libraries, not just Python itself.
Missing any of these usually leads to: “ModuleNotFoundError” “Version conflict” “Works on my machine”
Why Local Setup Becomes Painful
Local environments break down fast when:
you switch machines you collaborate with others you need more RAM or CPU you reinstall your OS Conda helps, Docker helps, but both still require:
learning curves maintenance debugging broken environments For many people, the problem isn’t coding — it’s environment reliability.
A Cleaner Approach: Pre-Configured Cloud VMs
One approach that’s worked well for me is using a pre-configured cloud VM where:
the OS is already set up common data science and ML libraries are pre-installed database connectors are ready SSH access works out of the box You spin it up, SSH in, and start coding.
No fighting pip. No rebuilding environments. No “let me install this first”.
This is especially useful when:
experimenting quickly onboarding new teammates running heavier workloads than a laptop can handle
What to Look for in a Data Science VM
If you go this route, make sure the VM actually provides:
20+ commonly used Python data science and ML libraries SQL and MongoDB client connectors SSH access with full control Scalable CPU and RAM No forced managed services you didn’t ask for GPU support is a bonus — but only matters if you truly need it.
A Practical Example (Disclosure)
I recently set this up internally as a Data Science VM with:
pre-installed Python data science and machine learning stack SQL and MongoDB connectors SSH access scalable resources If you’re curious what that looks like in practice, here’s a reference implementation:
https://manage.digirdp.com/store/data-science-vm
Disclosure: this is a product from my own infrastructure setup. Linking for reference, not as a requirement.
Final Thoughts
Tooling should get out of your way, not slow you down.
Whether you:
build your own base image use a managed platform or run a pre-configured VM the goal is the same:
Spend time on data and models, not environment firefighting.
If you’ve found cleaner ways to manage data science environments, I’d love to hear them in the comments.