⚡ Introduction
In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.
However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.
To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:
- Scrapes energy indicators for all African countries (2000–2024),
- Formats and validates the data for consistency,
- And stores it in a MongoDB database for easy access and analysis.
This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.
🧩 Problem Statem…
⚡ Introduction
In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.
However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.
To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:
- Scrapes energy indicators for all African countries (2000–2024),
- Formats and validates the data for consistency,
- And stores it in a MongoDB database for easy access and analysis.
This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.
🧩 Problem Statement
While the Africa Energy Portal provides valuable country-level datasets, it does not offer a bulk download option.
Researchers, analysts, and energy planners need historical time-series data — such as:
- Electricity generation and consumption
- Renewable energy contribution
- Access to clean cooking
- Population electrification (rural vs urban)
Manually downloading data for 50+ African countries and 20+ years would take days — not counting inconsistencies in data formats and missing years.
The solution: automate it end-to-end.
🧠 Project Goals
- Extract data for all African countries directly from the AEP website.
- Transform it into a structured, tabular format for analysis.
- Store it efficiently in MongoDB for scalability and retrieval.
- Validate data completeness and consistency across countries and indicators.
- Export the final cleaned dataset for analysis and sharing.
⚙️ Tools & Technologies
| Purpose | Tool / Library | Role |
|---|---|---|
| Web scraping | Playwright | Automates browser-based data capture |
| Environment & Dependency Management | uv | Manages virtual environment and packages |
| Data storage | MongoDB | Stores country-wise metrics and year data |
| Data validation & analysis | pandas, pydantic | Cleans and structures data |
| Export | openpyxl | Saves Excel files |
| Scripting | Python | Glue for the entire ETL process |
🔄 ETL Pipeline Overview
The pipeline consists of four modular stages:
Stage 1 – Data Extraction
- Uses Playwright to navigate to each country’s profile page.
- Intercepts the
/get-country-dataXHR response. - Extracts JSON payloads containing all available indicators and yearly values.
Each JSON record includes:
{
"country": "Kenya",
"metric": "Population with access to electricity - National",
"sector": "ELECTRICITY ACCESS",
"yearly": {
"2015": 19.65,
"2016": 25.73,
"2022": 42.62
}
}
Stage 2 – Data Formatting
- Converts raw JSON into a tabular schema: [“country”, “country_serial”, “metric”, “unit”, “sector”, “sub_sector”, “sub_sub_sector”, “source_link”, “source”, “2000”, ..., “2024”]
- Ensures each row represents one metric for one country.
- Fills missing years with
nullvalues to maintain consistency.
Stage 3 – Data Storage
- Inserts formatted records into MongoDB using
pymongo. - Adds a unique index
(country, metric, source)to prevent duplicates. - Upserts records — ensuring updates don’t create duplicates.
Each MongoDB document looks like this:
{
"country": "Kenya",
"metric": "Access to Clean Cooking%",
"source": "Tracking SDG7/WBG",
"2000": null,
"2015": 11.9,
"2020": 23.6,
"2024": null
}
Stage 4 – Validation
- Identifies missing years or inconsistent units.
- Detects countries with incomplete datasets.
- Exports a detailed
validation_report.csvthat flags issues automatically.
Sample output: | issue_type | country | metric | details | |———––|–––––|––––|–––––| | MISSING_YEARS | Kenya | Access to Clean Cooking% | 2000–2014, 2023–2024 | | UNIT_INCONSISTENCY | ALL | Electricity Access | %; MW |
🧾 Data Export
Once the ETL pipeline finishes, data is exported to both CSV and Excel formats for analysis.
uv run python export_to_csv.py
Output files:
reports/exports/energy_data.csvreports/exports/energy_data.xlsx
⚠️ Challenges Faced
| Challenge | Description |
|---|---|
| Cloudflare protection | The AEP website blocked simple HTTP requests (403, 500). Solved by using Playwright’s browser simulation to mimic human behavior. |
| Slow response times | Some pages took >30 seconds to return data. Added retry logic and longer timeouts. |
| Inconsistent URL naming | Country URLs (like cote-d’ivoire vs cote-divoire) required slug normalization logic. |
| Incomplete datasets | Some countries lacked data for certain years, handled via validation. |
| Browser resource use | Playwright’s real browser automation was resource-heavy; introduced throttling to manage load. |
📊 Results
- ✅ Successfully extracted data for 50 African countries
- ✅ Collected 500+ indicators covering 2000–2024
- ✅ All records stored in MongoDB with proper schema
- ✅ Automated validation caught missing and inconsistent data
- ✅ Exportable formats ready for visualization and analysis
💡 Key Takeaways
- Automating data extraction from protected websites is possible using browser-level automation (Playwright).
- Designing modular ETL stages makes maintenance and debugging easier.
- Data validation is just as important as extraction — raw data is rarely clean.
- Storing data in MongoDB offers flexibility for hierarchical (nested) data structures.
🧠 Future Work
- Extend scraping to additional AEP datasets (energy pricing, CO₂ emissions).
- Build an interactive dashboard using Streamlit or Power BI.
- Automate periodic updates (monthly/quarterly).
- Add country-level time-series visualization modules.