🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)

⚡ Introduction

In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.

However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.

To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:

Scrapes energy indicators for all African countries (2000–2024),
Formats and validates the data for consistency,
And stores it in a MongoDB database for easy access and analysis.

This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.

🧩 Problem Statem…

⚡ Introduction

In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.

However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.

To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:

Scrapes energy indicators for all African countries (2000–2024),
Formats and validates the data for consistency,
And stores it in a MongoDB database for easy access and analysis.

This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.

🧩 Problem Statement

While the Africa Energy Portal provides valuable country-level datasets, it does not offer a bulk download option.

Researchers, analysts, and energy planners need historical time-series data — such as:

Electricity generation and consumption
Renewable energy contribution
Access to clean cooking
Population electrification (rural vs urban)

Manually downloading data for 50+ African countries and 20+ years would take days — not counting inconsistencies in data formats and missing years.

The solution: automate it end-to-end.

🧠 Project Goals

Extract data for all African countries directly from the AEP website.
Transform it into a structured, tabular format for analysis.
Store it efficiently in MongoDB for scalability and retrieval.
Validate data completeness and consistency across countries and indicators.
Export the final cleaned dataset for analysis and sharing.

⚙️ Tools & Technologies

Purpose	Tool / Library	Role
Web scraping	Playwright	Automates browser-based data capture
Environment & Dependency Management	uv	Manages virtual environment and packages
Data storage	MongoDB	Stores country-wise metrics and year data
Data validation & analysis	`pandas`, `pydantic`	Cleans and structures data
Export	`openpyxl`	Saves Excel files
Scripting	`Python`	Glue for the entire ETL process

🔄 ETL Pipeline Overview

The pipeline consists of four modular stages:

Stage 1 – Data Extraction

Uses Playwright to navigate to each country’s profile page.
Intercepts the /get-country-data XHR response.
Extracts JSON payloads containing all available indicators and yearly values.

Each JSON record includes:

{
"country": "Kenya",
"metric": "Population with access to electricity - National",
"sector": "ELECTRICITY ACCESS",
"yearly": {
"2015": 19.65,
"2016": 25.73,
"2022": 42.62
}
}

Stage 2 – Data Formatting

Converts raw JSON into a tabular schema: [“country”, “country_serial”, “metric”, “unit”, “sector”, “sub_sector”, “sub_sub_sector”, “source_link”, “source”, “2000”, ..., “2024”]
Ensures each row represents one metric for one country.
Fills missing years with null values to maintain consistency.

Stage 3 – Data Storage

Inserts formatted records into MongoDB using pymongo.
Adds a unique index (country, metric, source) to prevent duplicates.
Upserts records — ensuring updates don’t create duplicates.

Each MongoDB document looks like this:

{
"country": "Kenya",
"metric": "Access to Clean Cooking%",
"source": "Tracking SDG7/WBG",
"2000": null,
"2015": 11.9,
"2020": 23.6,
"2024": null
}

Stage 4 – Validation

Identifies missing years or inconsistent units.
Detects countries with incomplete datasets.
Exports a detailed validation_report.csv that flags issues automatically.

Sample output: | issue_type | country | metric | details | |———––|–––––|––––|–––––| | MISSING_YEARS | Kenya | Access to Clean Cooking% | 2000–2014, 2023–2024 | | UNIT_INCONSISTENCY | ALL | Electricity Access | %; MW |

🧾 Data Export

Once the ETL pipeline finishes, data is exported to both CSV and Excel formats for analysis.

uv run python export_to_csv.py

Output files:

reports/exports/energy_data.csv
reports/exports/energy_data.xlsx

⚠️ Challenges Faced

Challenge	Description
Cloudflare protection	The AEP website blocked simple HTTP requests (403, 500). Solved by using Playwright’s browser simulation to mimic human behavior.
Slow response times	Some pages took >30 seconds to return data. Added retry logic and longer timeouts.
Inconsistent URL naming	Country URLs (like `cote-d’ivoire` vs `cote-divoire`) required slug normalization logic.
Incomplete datasets	Some countries lacked data for certain years, handled via validation.
Browser resource use	Playwright’s real browser automation was resource-heavy; introduced throttling to manage load.

📊 Results

✅ Successfully extracted data for 50 African countries
✅ Collected 500+ indicators covering 2000–2024
✅ All records stored in MongoDB with proper schema
✅ Automated validation caught missing and inconsistent data
✅ Exportable formats ready for visualization and analysis

💡 Key Takeaways

Automating data extraction from protected websites is possible using browser-level automation (Playwright).
Designing modular ETL stages makes maintenance and debugging easier.
Data validation is just as important as extraction — raw data is rarely clean.
Storing data in MongoDB offers flexibility for hierarchical (nested) data structures.

🧠 Future Work

Extend scraping to additional AEP datasets (energy pricing, CO₂ emissions).
Build an interactive dashboard using Streamlit or Power BI.
Automate periodic updates (monthly/quarterly).
Add country-level time-series visualization modules.

⚡ Introduction

🧩 Problem Statem…

⚡ Introduction

🧩 Problem Statement

🧠 Project Goals

⚙️ Tools & Technologies

🔄 ETL Pipeline Overview

Stage 1 – Data Extraction

Stage 2 – Data Formatting

Stage 3 – Data Storage

Stage 4 – Validation

🧾 Data Export

⚠️ Challenges Faced

📊 Results

💡 Key Takeaways

🧠 Future Work

Similar Posts