Running Apache Airflow for a year with ~10 DAGs is enough time to learn what hurts: inconsistent folder structures, copy-pasted business logic, βmystery DAGsβ no one remembers writing, and configuration hard-coded all over the place. My own Airflow environment started exactly this way β it worked, it ran daily, and it delivered results β but it became very difficult to maintain. Adding new datasets meant adding more boilerplate. Bringing collaborators in only amplified the pain.
This guide documents how I refactored my Airflow project into a clean, modular, team-ready framework. It includes real folder structures, a sample config.yaml, a reusable transform module, and a production-ready DAG example.
If youβre planning to evolve your Airflow project from βwhatever worksβ to βeβ¦
Running Apache Airflow for a year with ~10 DAGs is enough time to learn what hurts: inconsistent folder structures, copy-pasted business logic, βmystery DAGsβ no one remembers writing, and configuration hard-coded all over the place. My own Airflow environment started exactly this way β it worked, it ran daily, and it delivered results β but it became very difficult to maintain. Adding new datasets meant adding more boilerplate. Bringing collaborators in only amplified the pain.
This guide documents how I refactored my Airflow project into a clean, modular, team-ready framework. It includes real folder structures, a sample config.yaml, a reusable transform module, and a production-ready DAG example.
If youβre planning to evolve your Airflow project from βwhatever worksβ to βengineered for scale,β this walkthrough will save you weeks.
1. Why Refactor an Existing Airflow Project?
After a year of daily operation, I realized:
1. The original folder layout could not scale
Different DAGs dumped logic into dags/, creating an unmaintainable mix of:
- scrapers,
- ETL logic,
- business rules,
- helper utilities.
2. Adding a new DAG required copying another DAG
Hard-coding URLs, S3 paths, filenames, and API keys made every script unique.
3. No central configuration
Each DAG carried its own constants, leading to:
- inconsistent naming conventions,
- hard-to-update pipelines,
- unclear data lineage.
4. Solo developer β team development
What works for one person does not work for four. Team readiness requires:
- clear conventions,
- shared modules,
- safe refactoring boundaries,
- testable logic outside Airflow.
2. Final Refactored Folder Structure
Here is the structure I migrated to β clean, layered, and highly extensible:
airflow-project/
βββ common/
β βββ config_loader.py
β βββ paths.py
β βββ settings.py
βββ config/
β βββ base.yaml
β βββ sports.yaml
β βββ github.yaml
β βββ social_media.yaml
βββ dags/
β βββ sports_stats_update.py
β βββ github_repository_metrics.py
β βββ social_media_publisher.py
βββ plugins/
β βββ operators/
β βββ http_operator.py
β βββ s3_upload_operator.py
βββ transforms/
β βββ sports/
β β βββ sports_scraper.py
β βββ github/
β β βββ github_analytics.py
β βββ social/
β βββ tweet_processor.py
βββ docker-compose.yml
Key principles:
dags/contains only orchestration.transforms/holds business logic.config/provides per-pipeline configuration.common/contains reusable utilities (paths, loaders, env settings).- Operators live under
plugins/.
This separation turns DAGs into thin orchestration layers.
3. The Power of Central Configuration (config/example.yaml)
One of the most effective improvements was switching to YAML-driven DAG configuration.
Hereβs a simplified example for a sports data pipeline:
config/sports.yaml
---
dataset: "daily_sports_stats"
fetch:
url: "https://api.sportsdata.io/v3/scores/json/GamesByDate"
params:
api_key: "{{ env.SPORTS_API_KEY }}"
date_format: "%Y-%m-%d"
storage:
raw_dir: "sports/raw/"
cleaned_dir: "sports/cleaned/"
suffix: "_sports_stats.json"
schedule:
cron: "0 3 * * *"
Why YAML?
- No hard-coded credentials or endpoints in DAGs.
- Ability to blindly clone a DAG and reconfigure it by switching YAML.
- New developers understand pipelines just by reading config files.
- CI/CD can validate configs before deployment.
4. Example of a Reusable Transform Module
This is where real refactoring value appears.
transforms/sports/sports_scraper.py
import requests
import json
from datetime import datetime
from typing import Dict, Any
def fetch_sports_data(config: Dict[str, Any]) -> Dict:
url = config["fetch"]["url"]
params = config["fetch"]["params"].copy()
# Resolve env vars inside params
for k, v in params.items():
if isinstance(v, str) and v.startswith("{{ env."):
env_key = v.replace("{{ env.", "").replace(" }}", "")
params[k] = os.getenv(env_key)
params["date"] = datetime.utcnow().strftime(
config["fetch"]["params"].get("date_format", "%Y-%m-%d")
)
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()
def clean_sports_data(data: Dict) -> Dict:
# Example cleaning operation
return {
"games": [
{
"id": game.get("GameID"),
"home_team": game.get("HomeTeam"),
"away_team": game.get("AwayTeam"),
"status": game.get("Status"),
}
for game in data
]
}
def save_json(data: Dict, output_path: str) -> str:
with open(output_path, "w") as f:
json.dump(data, f, indent=2)
return output_path
This is fully unit-testable without Airflow.
5. A Clean, Minimal DAG Example
Now the DAG becomes simple and declarative.
dags/sports_stats_update.py
from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
from common.config_loader import load_config
from transforms.sports.sports_scraper import (
fetch_sports_data,
clean_sports_data,
save_json,
)
CONFIG = load_config("sports.yaml")
def task_fetch(**_):
data = fetch_sports_data(CONFIG)
path = f"/opt/airflow/data/{CONFIG['storage']['raw_dir']}{datetime.utcnow().date()}.json"
return save_json(data, path)
def task_clean(**_):
raw_path = f"/opt/airflow/data/{CONFIG['storage']['raw_dir']}{datetime.utcnow().date()}.json"
with open(raw_path) as f:
data = json.load(f)
cleaned = clean_sports_data(data)
output_path = f"/opt/airflow/data/{CONFIG['storage']['cleaned_dir']}{datetime.utcnow().date()}_cleaned.json"
return save_json(cleaned, output_path)
with DAG(
dag_id="sports_stats_update",
schedule=CONFIG["schedule"]["cron"],
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
fetch_task = PythonOperator(
task_id="fetch_sports_data",
python_callable=task_fetch,
)
clean_task = PythonOperator(
task_id="clean_sports_data",
python_callable=task_clean,
)
fetch_task >> clean_task
This DAG contains:
- no URLs
- no S3 paths
- no API keys
- no transform logic
- very little Python
All complexity is handled in:
config/transforms/
This is how a team scales.
6. Benefits Observed After Refactoring
1. Adding new DAGs takes minutes, not hours
Copy a config file β write a new small DAG β reuse existing transforms.
2. Team onboarding became smooth
Developers understand the project through:
- folder structure,
- YAML files,
- self-contained transform modules.
3. Fewer mistakes
Configuration mistakes appear early since each pipeline has predictable structure.
4. CI/CD enforcement
We could now run:
rufffor lintingblackfor formattingyamllintfor config validation- pre-commit hooks for consistency
7. Final Recommendations for Anyone Maintaining Multiple Airflow DAGs
Keep business logic out of DAGs They should only orchestrate tasks. 1.
Use per-DAG config files YAML is the simplest and most team-friendly. 1.
Create a transforms/ directory for all processing code This is the heart of maintainability. 1.
Adopt a common/ module Shared utilities reduce duplication. 1.
Ensure your DAGs are βthinβ A DAG file should not exceed ~150 lines. 1.
Run pre-commit
- ruff
- black
- yamllint
- isort
- Make Docker your default local environment Airflow dependency management becomes painless.
Closing Thoughts
Refactoring an Airflow project after a year of operation is not just cleanup β it is an investment in team scalability, developer happiness, and long-term reliability. Moving to a layered structure with reusable transforms and YAML-driven configuration was the most impactful improvement I made.
If your Airflow codebase feels βjust good enough,β consider standardizing now. It pays off immediately when the first new teammate joins.