Organizing Source Code for Scientific Programmers: Let's Start a Conversation

If you’ve ever opened a scientific code repository and found yourself lost in a maze of analysis files, scripts, data, and outputs all jumbled together, you’re not alone. While line-of-business developers have established conventions (like David Fowler’s .NET project structure), scientific computing has its own unique needs that require a different approach.

Scientific code repositories face challenges that traditional software projects don’t: managing raw and processed data, organizing interactive analyses alongside automated scripts, handling computational experiments, and ensuring reproducibility across different computing environments and languages.

This post proposes a starting point for discussion. I’d love to hea…

This post proposes a starting point for discussion. I’d love to hear from the community: How do you organize your scientific codebases? What works? What doesn’t? What am I missing?

Why Organization Matters in Scientific Computing

Poor organization doesn’t just create inconvenience—it creates real problems. Lost data, irreproducible analyses, and hours wasted searching for files are common consequences. A well-organized repository accelerates science, facilitates collaboration, and makes your research reproducible.

When everyone on your team can predict where files should be, collaboration becomes intuitive rather than frustrating. But what’s the right structure?

The Fundamental Principle: Separate by Type and Purpose

The key organizing principle for scientific repositories is to structure by file type and purpose. This creates consistency across projects and makes it intuitive to find what you need. Here’s a proposed structure:

project-name/
├── data/
│   ├── raw/
│   └── processed/
├── src/
│   ├── data_processing/
│   ├── analysis/
│   └── visualization/
├── notebooks/
├── scripts/
├── results/
│   ├── figures/
│   └── output/
├── docs/
├── tests/
├── environment/
├── README.md
├── LICENSE
└── .gitignore

Let’s break down each component and when to use it. But remember: this is a starting point for discussion, not a rigid prescription.

Core Directories

`data/` - The Sacred Ground

Your data directory should have two subdirectories:

data/raw/: Original, unmodified data files. Treat this as immutable. Never, ever modify files here. You can even set these files as read-only to prevent accidents.
data/processed/: Cleaned, transformed, or analyzed versions of your data.

This separation is crucial for reproducibility. Anyone should be able to run your processing code on the raw data and regenerate your processed data.

Important note: For large datasets, you may want to store only metadata or download scripts in version control rather than the actual data files. Consider using services like OSF, Zenodo, or institutional repositories for large data files.

Question for the community: Do you organize data differently? How do you handle intermediate processing stages? Do you have a data/interim/ directory?

`src/` - Your Core Analysis Code

This is where reusable, production-quality code lives. Think of this as the "scientific guts" of your project. Code here should be:

Organized into logical modules or packages
Well-documented
Testable and (ideally) tested
Importable from interactive environments or scripts

Question for the community: How do you organize multi-language projects? Do you have separate directories for different languages, or do you mix them in src/?

`notebooks/` - Exploration and Prototyping

Interactive environments (Jupyter notebooks, R Markdown, Pluto.jl notebooks, MATLAB Live Scripts, Mathematica notebooks) are fantastic for exploration, but they come with challenges: they can encourage non-modular code, become difficult to test, and tend to accumulate cruft over time.

Best practices:

Use interactive environments for exploration, visualization, and prototyping
Keep them focused on specific questions or analyses
Name them clearly with numbers for ordering: 01-data-exploration.ipynb, 02-initial-modeling.Rmd, 03-sensitivity-analysis.jl
Transition mature code to the src/ directory
Import functions from src/ rather than copying code between notebooks

When an interactive session starts to feel unwieldy, extract the reusable parts into modules in src/.

Question for the community: Some researchers prefer to keep all work in notebooks/scripts. Others prefer to move everything to modules. What’s your philosophy? Does it depend on the project stage?

`scripts/` - Automation and Batch Processing

Scripts are for automated, reproducible workflows. These files:

Run without interaction
Accept command-line arguments or configuration files
Can be executed on computing clusters or in pipelines
Orchestrate complete analyses from start to finish

Example use cases:

Data download and preprocessing pipelines
Running models with different parameters
Generating all figures for a paper
Batch processing multiple datasets

A controller script (e.g., run_all.sh, Makefile, Snakefile) that executes the entire analysis workflow in order is extremely valuable for reproducibility.

Question for the community: Do you use workflow managers (Make, Snakemake, Nextflow, Drake, Luigi)? How do you organize pipeline definitions?

`results/` - Analysis Outputs

Store generated outputs here, not in version control (add to .gitignore):

results/
├── figures/        # Plots, visualizations
├── output/         # Tables, statistics, processed results
└── models/         # Trained models, fitted parameters

Why separate from data? Results are generated by code and should be reproducible. If you lose them, you can regenerate them. Data (especially raw data) cannot be regenerated.

Question for the community: Do you version control any results? How do you handle results that take days/weeks to generate?

`docs/` - Documentation

Include:

Project documentation
Analysis notes or lab notebook entries
Manuscript drafts
Supplementary materials
API documentation (if auto-generated)

Question for the community: Where do you keep your manuscript? In the repo? In a separate repo? In Overleaf or Google Docs?

`tests/` - Test Code

Yes, even scientific code should have tests! At minimum, include:

Unit tests for core functions
Integration tests for workflows
Validation tests that check against known results

Testing helps ensure correctness and catches bugs when you modify code.

Question for the community: What’s your testing philosophy for scientific code? Do you test everything, just critical functions, or not at all?

Essential Files in the Root

README.md

Your README is the front door to your project. It should include:

Project overview and goals
Installation instructions
Quick start guide
Project structure explanation
How to reproduce key results
Dependencies and requirements
Citation information
Contact information

Environment Management

Include files that specify your computational environment. The specifics depend on your language and tools:

Python:

environment.yml (for conda)
requirements.txt (for pip)
pyproject.toml (for modern Python packaging)

renv.lock (for renv)
DESCRIPTION (for R packages)
install.R (installation script)

Julia:

Project.toml and Manifest.toml

MATLAB:

Dependency list in README or separate documentation

Multi-language projects:

Dockerfile (for containerized environments)
Separate environment files for each language
Shell script that sets up entire environment

Question for the community: How do you handle dependencies that span multiple languages? Do you use containers? Virtual machines? Detailed documentation?

LICENSE

If your project is open source, include a license. Common choices for scientific code include MIT, BSD, or GPL licenses.

.gitignore

Prevent clutter by ignoring generated files. Language-specific examples can be found at https://github.com/github/gitignore/tree/main.

All projects:

data/raw/*
results/*
.DS_Store (for the macOS operating system)

Organizing for Different Project Scales

Small Projects (Single Analysis → Single Paper)

For small projects, a simplified structure works well:

project/
├── data/
├── analysis/
├── results/
├── environment/
└── README.md

This is fine for exploratory work or class projects where you don’t expect significant expansion.

Large Projects (Multiple Studies, Multiple Papers)

For field studies, multi-year projects, or projects with multiple outputs:

project/
├── data/
│   ├── study1/
│   ├── study2/
│   └── shared/
├── src/
│   ├── preprocessing/
│   ├── analysis_core/
│   └── utils/
├── analyses/
│   ├── paper1/
│   ├── paper2/
│   └── exploratory/
├── docs/
└── manuscripts/
├── paper1/
└── paper2/

The key is organizing analyses by the output (paper, report) they support while keeping shared code in a central src/ directory.

Question for the community: How do you organize multi-year, multi-paper projects? One repo or many? How do you handle shared code?

Self-Contained Projects

Strive for self-containedness: everything needed to reproduce your analysis should be in the project directory. This sometimes conflicts with avoiding duplication (e.g., when multiple projects share data), but for publishable research, self-containedness usually wins.

A colleague should be able to:

Clone your repository
Set up the environment
Run your analysis scripts
Reproduce your results

Question for the community: How do you balance self-containedness with sharing code/data between projects?

Version Control Best Practices

What to Track

All code (scripts, interactive analyses, source files)
Documentation
Environment specifications
Small data files (< 100MB)
Metadata about large data files

What NOT to Track

Large data files (use data repositories instead)
Generated results and figures
Temporary files
Language-specific artifacts (compiled binaries, caches)
IDE-specific settings (unless shared intentionally)

Commit Messages

Good commit messages should describe what changed scientifically, not just what changed in the code:

❌ "Updated analysis"

✅ "Added bootstrap confidence intervals to treatment effect estimates"

Question for the community: Do you have commit message conventions for scientific projects?

Common Pitfalls and Solutions

Pitfall 1: The Analysis File Sprawl

Problem: 50 files with names like "Untitled1", "Copy of Final_Analysis_v3"

Solution:

Name files descriptively with numbered prefixes
Regularly archive or delete obsolete files
Transition stable code to modules

Pitfall 2: Data Soup

Problem: Raw data, processed data, results all mixed together

Solution:

Strict separation of raw/processed/results
Document data provenance
Use metadata files to track processing steps

Pitfall 3: Hardcoded Paths

Problem: Absolute paths that only work on one person’s computer

Solution:

Use relative paths from the project root
Use path manipulation libraries (pathlib in Python, file.path in R, etc.)
Consider configuration files for environment-specific settings

Pitfall 4: Missing Dependencies

Problem: "It works on my machine" but fails everywhere else

Solution:

Maintain updated environment specifications
Document system dependencies
Consider using containers (Docker, Singularity) for complex environments

Question for the community: What other pitfalls do you encounter? What solutions work for you?

Making This Real: A Call for Examples

Theory is great, but examples are better. I’d love to see:

Links to well-organized public scientific repositories
Templates or cookiecutter projects for different fields
Adaptations of this structure for specific use cases
Counter-arguments: When does this structure NOT work?

Conclusion: The Conversation Starts Here

Good code organization isn’t about following rules dogmatically—it’s about making your life easier and your science better. The structure I’ve proposed here is a starting point, not a final answer.

Core principles that seem universal:

Separate concerns: Data, code, and results in different directories
Preserve raw data: Never modify original data files
Make code modular: Extract reusable functionality
Document everything: Future you will thank present you
Use version control: Track changes and enable collaboration
Enable reproduction: Someone else should be able to reproduce your work

But the implementation of these principles will vary based on:

Your programming language(s)
Your field’s conventions
Your team’s preferences
Your project’s scale and complexity
Your computing environment (laptop, HPC cluster, cloud)

Your Turn: Join the Discussion

I’d love to hear from you:

What works? How do you organize your scientific code? What’s your directory structure?
What doesn’t work? What have you tried that failed? What pain points remain?
What’s missing? What essential aspects of scientific code organization did I overlook?
Language-specific tips? What works particularly well in your language of choice?
Field-specific conventions? What are the norms in your discipline?

Please share your experiences in the comments. Let’s build a community knowledge base of what actually works in practice.

Further Resources

"Good Enough Practices in Scientific Computing" by Wilson et al.1 - Comprehensive guide to scientific computing practices

Software Carpentry - Workshops on version control, testing, and project organization2

"Ten Simple Rules" series in PLOS Computational Biology3 - Including rules for taking advantage of Git and GitHub

Cookie Cutter Data Science - A standardized project structure template4

The Turing Way - Handbook for reproducible, ethical, and collaborative data science5

Remember: the goal isn’t perfection, it’s progress. Start organizing better today, and iterate as you learn what works for you and your team.

What’s your approach to organizing scientific code? Share your structure, tips, or questions in the comments below!

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510 ↩ 1.

Software Carpentry. "Lessons." https://software-carpentry.org/lessons/ ↩ 1.

Perez-Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost FdV, et al. (2016) Ten Simple Rules for Taking Advantage of Git and GitHub. PLOS Computational Biology 12(7): e1004947. https://doi.org/10.1371/journal.pcbi.1004947 ↩ 1.

DrivenData. "Cookiecutter Data Science." https://cookiecutter-data-science.drivendata.org/ ↩ 1.

The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research. Zenodo. https://doi.org/10.5281/zenodo.3233853 ↩

Why Organization Matters in Scientific Computing

The Fundamental Principle: Separate by Type and Purpose

Core Directories

`data/` - The Sacred Ground

`src/` - Your Core Analysis Code

`notebooks/` - Exploration and Prototyping

`scripts/` - Automation and Batch Processing

`results/` - Analysis Outputs

`docs/` - Documentation

`tests/` - Test Code

Essential Files in the Root

README.md

Environment Management

LICENSE

.gitignore

Organizing for Different Project Scales

Small Projects (Single Analysis → Single Paper)

Large Projects (Multiple Studies, Multiple Papers)

Self-Contained Projects

Version Control Best Practices

What to Track

What NOT to Track

Commit Messages

❌ "Updated analysis"

Common Pitfalls and Solutions

Pitfall 1: The Analysis File Sprawl

Pitfall 2: Data Soup

Pitfall 3: Hardcoded Paths

Pitfall 4: Missing Dependencies

Making This Real: A Call for Examples

Conclusion: The Conversation Starts Here

Your Turn: Join the Discussion

Further Resources

"Good Enough Practices in Scientific Computing" by Wilson et al.1 - Comprehensive guide to scientific computing practices

Software Carpentry - Workshops on version control, testing, and project organization2

"Ten Simple Rules" series in PLOS Computational Biology3 - Including rules for taking advantage of Git and GitHub

Cookie Cutter Data Science - A standardized project structure template4

Similar Posts

Why Organization Matters in Scientific Computing

The Fundamental Principle: Separate by Type and Purpose

Core Directories

data/ - The Sacred Ground

src/ - Your Core Analysis Code

notebooks/ - Exploration and Prototyping

scripts/ - Automation and Batch Processing

results/ - Analysis Outputs

docs/ - Documentation

tests/ - Test Code

Essential Files in the Root

README.md

Environment Management

LICENSE

.gitignore

Organizing for Different Project Scales

Small Projects (Single Analysis → Single Paper)

Large Projects (Multiple Studies, Multiple Papers)

Self-Contained Projects

Version Control Best Practices

What to Track

What NOT to Track

Commit Messages

❌ "Updated analysis"

Common Pitfalls and Solutions

Pitfall 1: The Analysis File Sprawl

Pitfall 2: Data Soup

Pitfall 3: Hardcoded Paths

Pitfall 4: Missing Dependencies

Making This Real: A Call for Examples

Conclusion: The Conversation Starts Here

Your Turn: Join the Discussion

Further Resources

"Good Enough Practices in Scientific Computing" by Wilson et al.1 - Comprehensive guide to scientific computing practices

Software Carpentry - Workshops on version control, testing, and project organization2

"Ten Simple Rules" series in PLOS Computational Biology3 - Including rules for taking advantage of Git and GitHub

Cookie Cutter Data Science - A standardized project structure template4

Similar Posts

`data/` - The Sacred Ground

`src/` - Your Core Analysis Code

`notebooks/` - Exploration and Prototyping

`scripts/` - Automation and Batch Processing

`results/` - Analysis Outputs

`docs/` - Documentation

`tests/` - Test Code