Refactoring a Mature Airflow Project: A Practical Guide to Scaling from Solo Development to Team Collaboration

Running Apache Airflow for a year with ~10 DAGs is enough time to learn what hurts: inconsistent folder structures, copy-pasted business logic, “mystery DAGs” no one remembers writing, and configuration hard-coded all over the place. My own Airflow environment started exactly this way — it worked, it ran daily, and it delivered results — but it became very difficult to maintain. Adding new datasets meant adding more boilerplate. Bringing collaborators in only amplified the pain.

This guide documents how I refactored my Airflow project into a clean, modular, team-ready framework. It includes real folder structures, a sample config.yaml, a reusable transform module, and a production-ready DAG example.

If you’re planning to evolve your Airflow project from “whatever works” to “e…

If you’re planning to evolve your Airflow project from “whatever works” to “engineered for scale,” this walkthrough will save you weeks.

1. Why Refactor an Existing Airflow Project?

After a year of daily operation, I realized:

1. The original folder layout could not scale

Different DAGs dumped logic into dags/, creating an unmaintainable mix of:

scrapers,
ETL logic,
business rules,
helper utilities.

2. Adding a new DAG required copying another DAG

Hard-coding URLs, S3 paths, filenames, and API keys made every script unique.

3. No central configuration

Each DAG carried its own constants, leading to:

inconsistent naming conventions,
hard-to-update pipelines,
unclear data lineage.

4. Solo developer → team development

What works for one person does not work for four. Team readiness requires:

clear conventions,
shared modules,
safe refactoring boundaries,
testable logic outside Airflow.

2. Final Refactored Folder Structure

Here is the structure I migrated to — clean, layered, and highly extensible:

airflow-project/
├── common/
│   ├── config_loader.py
│   ├── paths.py
│   └── settings.py
├── config/
│   ├── base.yaml
│   ├── sports.yaml
│   ├── github.yaml
│   └── social_media.yaml
├── dags/
│   ├── sports_stats_update.py
│   ├── github_repository_metrics.py
│   └── social_media_publisher.py
├── plugins/
│   └── operators/
│       ├── http_operator.py
│       └── s3_upload_operator.py
├── transforms/
│   ├── sports/
│   │   └── sports_scraper.py
│   ├── github/
│   │   └── github_analytics.py
│   └── social/
│       └── tweet_processor.py
└── docker-compose.yml

Key principles:

dags/ contains only orchestration.
transforms/ holds business logic.
config/ provides per-pipeline configuration.
common/ contains reusable utilities (paths, loaders, env settings).
Operators live under plugins/.

This separation turns DAGs into thin orchestration layers.

3. The Power of Central Configuration (config/example.yaml)

One of the most effective improvements was switching to YAML-driven DAG configuration.

Here’s a simplified example for a sports data pipeline:

`config/sports.yaml`

---
dataset: "daily_sports_stats"

fetch:
url: "https://api.sportsdata.io/v3/scores/json/GamesByDate"
params:
api_key: "{{ env.SPORTS_API_KEY }}"
date_format: "%Y-%m-%d"

storage:
raw_dir: "sports/raw/"
cleaned_dir: "sports/cleaned/"
suffix: "_sports_stats.json"

schedule:
cron: "0 3 * * *"

Why YAML?

No hard-coded credentials or endpoints in DAGs.
Ability to blindly clone a DAG and reconfigure it by switching YAML.
New developers understand pipelines just by reading config files.
CI/CD can validate configs before deployment.

4. Example of a Reusable Transform Module

This is where real refactoring value appears.

`transforms/sports/sports_scraper.py`

import requests
import json
from datetime import datetime
from typing import Dict, Any


def fetch_sports_data(config: Dict[str, Any]) -> Dict:
url = config["fetch"]["url"]
params = config["fetch"]["params"].copy()

# Resolve env vars inside params
for k, v in params.items():
if isinstance(v, str) and v.startswith("{{ env."):
env_key = v.replace("{{ env.", "").replace(" }}", "")
params[k] = os.getenv(env_key)

params["date"] = datetime.utcnow().strftime(
config["fetch"]["params"].get("date_format", "%Y-%m-%d")
)
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()


def clean_sports_data(data: Dict) -> Dict:
# Example cleaning operation
return {
"games": [
{
"id": game.get("GameID"),
"home_team": game.get("HomeTeam"),
"away_team": game.get("AwayTeam"),
"status": game.get("Status"),
}
for game in data
]
}


def save_json(data: Dict, output_path: str) -> str:
with open(output_path, "w") as f:
json.dump(data, f, indent=2)
return output_path

This is fully unit-testable without Airflow.

5. A Clean, Minimal DAG Example

Now the DAG becomes simple and declarative.

`dags/sports_stats_update.py`

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

from common.config_loader import load_config
from transforms.sports.sports_scraper import (
fetch_sports_data,
clean_sports_data,
save_json,
)


CONFIG = load_config("sports.yaml")


def task_fetch(**_):
data = fetch_sports_data(CONFIG)
path = f"/opt/airflow/data/{CONFIG['storage']['raw_dir']}{datetime.utcnow().date()}.json"
return save_json(data, path)


def task_clean(**_):
raw_path = f"/opt/airflow/data/{CONFIG['storage']['raw_dir']}{datetime.utcnow().date()}.json"
with open(raw_path) as f:
data = json.load(f)
cleaned = clean_sports_data(data)
output_path = f"/opt/airflow/data/{CONFIG['storage']['cleaned_dir']}{datetime.utcnow().date()}_cleaned.json"
return save_json(cleaned, output_path)


with DAG(
dag_id="sports_stats_update",
schedule=CONFIG["schedule"]["cron"],
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:

fetch_task = PythonOperator(
task_id="fetch_sports_data",
python_callable=task_fetch,
)

clean_task = PythonOperator(
task_id="clean_sports_data",
python_callable=task_clean,
)

fetch_task >> clean_task

This DAG contains:

no URLs
no S3 paths
no API keys
no transform logic
very little Python

All complexity is handled in:

config/
transforms/

This is how a team scales.

6. Benefits Observed After Refactoring

1. Adding new DAGs takes minutes, not hours

Copy a config file → write a new small DAG → reuse existing transforms.

2. Team onboarding became smooth

Developers understand the project through:

folder structure,
YAML files,
self-contained transform modules.

3. Fewer mistakes

Configuration mistakes appear early since each pipeline has predictable structure.

4. CI/CD enforcement

We could now run:

ruff for linting
black for formatting
yamllint for config validation
pre-commit hooks for consistency

7. Final Recommendations for Anyone Maintaining Multiple Airflow DAGs

Keep business logic out of DAGs They should only orchestrate tasks. 1.

Use per-DAG config files YAML is the simplest and most team-friendly. 1.

Create a transforms/ directory for all processing code This is the heart of maintainability. 1.

Adopt a common/ module Shared utilities reduce duplication. 1.

Ensure your DAGs are “thin” A DAG file should not exceed ~150 lines. 1.

Run pre-commit

ruff
black
yamllint
isort

Make Docker your default local environment Airflow dependency management becomes painless.

Closing Thoughts

Refactoring an Airflow project after a year of operation is not just cleanup — it is an investment in team scalability, developer happiness, and long-term reliability. Moving to a layered structure with reusable transforms and YAML-driven configuration was the most impactful improvement I made.

If your Airflow codebase feels “just good enough,” consider standardizing now. It pays off immediately when the first new teammate joins.