On Monday last week, a paper was published announcing the OpenBuildingMap (OBM) dataset. With varying degrees of coverage, it contains building footprints, heights, 8 categories of usage and floorspace for 2.7B buildings across the globe. Below is a heatmap of its building footprints.
Existing data from OpenStreetMap (OSM), Googleβs Open Buildings, Microsoft Global ML Building Footprints and the Global Human Settlement Characteristics Layer were used to build this dataset. OBMβs paper states the authors didnβt use the CLSM dataset for East Asia due to uncertain licensing regarding the imagery that dataset was built from.
Efforts were made to clean up the β¦
On Monday last week, a paper was published announcing the OpenBuildingMap (OBM) dataset. With varying degrees of coverage, it contains building footprints, heights, 8 categories of usage and floorspace for 2.7B buildings across the globe. Below is a heatmap of its building footprints.
Existing data from OpenStreetMap (OSM), Googleβs Open Buildings, Microsoft Global ML Building Footprints and the Global Human Settlement Characteristics Layer were used to build this dataset. OBMβs paper states the authors didnβt use the CLSM dataset for East Asia due to uncertain licensing regarding the imagery that dataset was built from.
Efforts were made to clean up the data including removing buildings located over water bodies and excluding height information from any building claiming to be taller that the worldβs tallest tower, the Burj Khalifa in Dubai.
OBM is broken up into 1,270 BZip2-compressed GeoPackage (GPKG) files. I downloaded these and converted them into ZStandard-compressed, spatial-sorted Parquet files. These are now being hosted on AWS S3 thanks to the kind generosity of Source Cooperative and Taylor Geospatial Engine.
GPKG files are great if youβre going to edit the data but for analysis, surgical downloading and archiving, itβs really hard to beat Parquet. Parquet stores data as columns and has statistics for groups of rows in each column, often every 10-100K rows. This means some analytical queries only need a few MBs of bandwidth to analyse GBs or TBs of data stored remotely on AWS S3 or HTTPS-based CDNs like Cloudflareβs.
GPKG files are based on SQLite and are row-oriented so compressing them rarely reduces the file size by more than a 5:1 ratio. With columnar formats like Parquet, 15:1 reductions are the norm. Decompressing the BZip2 files in this post maxed out around 10MB/s on my system. ZStandard runs circles around this.
In this post, Iβll walk through converting this dataset into Parquet and examining some of its features.
My Workstation
Iβm using a 5.7 GHz AMD Ryzen 9 9950X CPU. It has 16 cores and 32 threads and 1.2 MB of L1, 16 MB of L2 and 64 MB of L3 cache. It has a liquid cooler attached and is housed in a spacious, full-sized Cooler Master HAF 700 computer case.
The system has 96 GB of DDR5 RAM clocked at 4,800 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my systemβs C drive.
The system is powered by a 1,200-watt, fully modular Corsair Power Supply and is sat on an ASRock X870E Nova 90 Motherboard.
Iβm running Ubuntu 24 LTS via Microsoftβs Ubuntu for Windows on Windows 11 Pro. In case youβre wondering why I donβt run a Linux-based desktop as my primary work environment, Iβm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and ArcGIS Pro only supports Windows natively.
Installing Prerequisites
Iβll use GDAL 3.9.3 and a few other tools to help analyse the data in this post.
$ sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
$ sudo apt update
$ sudo apt install \
gdal-bin \
jq
Iβll use DuckDB, along with its H3, JSON, Lindel, Parquet and Spatial extensions in this post.
$ cd ~
$ wget -c https://github.com/duckdb/duckdb/releases/download/v1.4.1/duckdb_cli-linux-amd64.zip
$ unzip -j duckdb_cli-linux-amd64.zip
$ chmod +x duckdb
$ ~/duckdb
INSTALL h3 FROM community;
INSTALL lindel FROM community;
INSTALL json;
INSTALL parquet;
INSTALL spatial;
Iβll set up DuckDB to load every installed extension each time it launches.
$ vi ~/.duckdbrc
.timer on
.width 180
LOAD h3;
LOAD lindel;
LOAD json;
LOAD parquet;
LOAD spatial;
The maps in this post were rendered with QGIS version 3.44. QGIS is a desktop application that runs on Windows, macOS and Linux. The application has grown in popularity in recent years and has ~15M application launches from users all around the world each month.
The dark heatmaps in this post are mostly made up of vector data from Natural Earth and Overture.
Downloading the Buildings
In QGIS, click the βPluginsβ Menu and then the βManage and Install Pluginsβ item. Click the βAllβ filter in the top left of the dialog and then search for βQuickMapServicesβ. Click to install or upgrade the plugin in the bottom right of the dialog.
Click the βWebβ Menu, then βQuickMapServicesβ and then the βSettingsβ item. Click the βMore Servicesβ tab at the top of the dialog. Click the βGet contributed packβ button.
In the βWebβ menu under βQuickMapServicesβ you should now see the list of several basemap providers.
Select βOSMβ and then βOSM Standardβ to add a world map to your scene.
Under the βPluginsβ menu, select βPython Consoleβ. Paste in the following line of Python. Itβll ensure youβve got the latest version of DuckDB installed in QGISβ Python Environment.
import pip; pip.main(['install', '--upgrade', 'duckdb'])
In QGIS, click the βPluginsβ Menu and then the βManage and Install Pluginsβ item. Click the βAllβ filter in the top left of the dialog and then search for βGeoParquet Downloaderβ. Click to install or upgrade the plugin in the bottom right of the dialog.
Zoom into a city of interest somewhere in the world. Itβs important to make sure youβre not looking at an area larger than a major city as the viewport will set the boundaries for how much data will be downloaded.
If you can see the whole earth, GeoParquet Downloader will end up downloading 194 GB of data. Downloading only a cityβs worth of data will likely only need a few MB of data.
In your toolbar, click on the GeoParquet Downloader icon. Itβs the one with blue, rotated rectangles.
Click the βCustom URLβ option and paste the following URL in.
s3://us-west-2.opendata.source.coop/tge-labs/openbuildingmap/*.parquet
Hit βOKβ. Youβll be prompted to save a Parquet file onto your computer. Once youβve done so, the building data for your mapβs footprint should appear shortly afterword.
Once the buildings have loaded, select the layer styling for the downloaded buildings layer. The combo box at the top of the styling panel should be switched from βSingle Symbolβ to β2.5Dβ.
Iβve set the roof colour to #b13327, wall colour to #4f090b, angle to 230 degrees and added a drop shadow effect to the layer.
In the βHeightβ field, change it to the following expression:
if(string_to_array(height, ':')[0] = 'H', string_to_array(height, ':')[1], 1) / 50000
You should now see a 2.5D representation of the buildings you downloaded.
There is an βIdentify Featuresβ icon in the toolbar.
This tool lets you select any building and bring up its metadata.
The Parquet file you saved can also be read via DuckDB and even dropped into Esriβs ArcGIS Pro 3.5 or newer if you have it installed on your machine.
Downloading OpenBuildingMap
Below Iβll show how I turned the GPKG files into Parquet. These steps donβt need to be repeated as the resulting files are now on S3.
OBM has a website where you can see the footprints of their GPKG files and download them individually. I found a GeoJSON-formatted manifest file used by the site. It contains the URLs for all of their BZ2-compressed GPKG files.
$ mkdir -p ~/obm
$ cd ~/obm
$ curl -S 'https://umap.openstreetmap.de/de/datalayer/81684/ac9788a9-c215-417d-a580-cb545acd78b9/' \
> manifest.json
Below is an example record from this manifest.
$ echo "SELECT a.properties::JSON
FROM (SELECT UNNEST(features) a
FROM   READ_JSON('manifest.json'))
LIMIT 1" \
| ~/duckdb -json \
| jq -S .
[
{
"CAST(a.properties AS \"JSON\")": {
"URL": "https://datapub.gfz.de/download/10.5880.GFZ.LKUT.2025.002-Caweb/2025-002_Oostwegel-et-al_data/building.002202.gpkg.bz2",
"completeness_index": 92.93,
"fid": 1,
"filename": "building.002202.gpkg",
"license": "ODbL",
"number_of_buildings": 127,
"percentage_known_floorspace": 0.0,
"percentage_known_height": 3.94,
"percentage_known_occupancy": 47.24,
"percentage_source_google": 0.0,
"percentage_source_microsoft": 0.0,
"percentage_source_openstreetmap": 100.0,
"quadkey": "002202"
}
}
]
Iβll build a list of URLs and use wget to download them with a concurrency of four.
$ echo "SELECT a.properties.URL
FROM (SELECT UNNEST(features) a
FROM   READ_JSON('manifest.json'))" \
| ~/duckdb -csv \
-noheader \
> manifest.txt
$ cat manifest.txt \
| xargs -P4 \
-I% \
wget -c "%"
GPKG to Parquet
The following converted 278 GB worth of BZip2 files containing ~722 GB worth of GPKG data into 194 GB of Parquet containing 2,693,211,269 records. Even with 24 concurrent processes running, it took the better part of a day to complete.
$ ls building.*.gpkg.bz2 \
| xargs -P24 \
-I% \
bash -c "
BASENAME=\`echo \"%\" | cut -f2 -d.\`
PQ_FILE=\"building.\$BASENAME.parquet\"
if [ ! -f \$PQ_FILE ]; then
echo \"Building \$PQ_FILE.\"
bzip2 -dk \"%\" --stdout > \"working.\$BASENAME.gpkg\"
echo \"COPY (
SELECT   * EXCLUDE (geom,
source_id),
geometry: geom,
bbox:     {'xmin': ST_XMIN(ST_EXTENT(geom)),
'ymin': ST_YMIN(ST_EXTENT(geom)),
'xmax': ST_XMAX(ST_EXTENT(geom)),
'ymax': ST_YMAX(ST_EXTENT(geom))},
source:   CASE WHEN source_id = 0 THEN 'OSM'
WHEN source_id = 1 THEN 'Google'
WHEN source_id = 2 THEN 'Microsoft'
END
FROM     ST_READ('working.\$BASENAME.gpkg')
ORDER BY HILBERT_ENCODE([
ST_Y(ST_CENTROID(geom)),
ST_X(ST_CENTROID(geom))]::double[2])
) TO '\$PQ_FILE' (
FORMAT            'PARQUET',
CODEC             'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE    15000);\" \
| ~/duckdb
rm \"working.\$BASENAME.gpkg\"
else
echo \"\$PQ_FILE already exists, skipping..\"
fi"
Below is what my systemβs CPU usage looked like while the above was running.
Data Fluency
The following is an example record from this dataset.
$ echo "SELECT * EXCLUDE(bbox),
bbox::JSON AS bbox
FROM   'building.*.parquet'
WHERE  floorspace IS NOT NULL
LIMIT  1" \
| ~/duckdb -json \
| jq -S .
[
{
"bbox": {
"xmax": -160.03150641447675,
"xmin": -160.03228425509127,
"ymax": 70.63789626901166,
"ymin": 70.63765269090544
},
"floorspace": "719.5123100951314",
"geometry": "POLYGON ((-160.03228425509127 70.63772552964403, -160.03180145746848 70.63789626901166, -160.03150641447675 70.63781446163671, -160.0315654230751 70.6377735579492, -160.0318229151406 70.63767574113916, -160.0319946603366 70.63765269090544, -160.03228425509127 70.63772552964403))",
"height": "H:2",
"id": 500236864,
"last_update": "2024-09-25 17:23:12.186+03",
"occupancy": "RES3",
"quadkey": "002213322233003100",
"relation_id": null,
"source": "OSM"
}
]
Below are the field names, data types, percentages of NULLs per column, number of unique values and minimum and maximum values for each column.
$ ~/duckdb
SELECT   column_name,
column_type,
null_percentage,
approx_unique,
min,
max
FROM     (SUMMARIZE
FROM 'building.*.parquet')
WHERE    column_name != 'geometry'
AND      column_name != 'bbox'
ORDER BY 1;
βββββββββββββββ¬βββββββββββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β column_name β       column_type        β null_percentage β approx_unique β            min             β            max            β
β   varchar   β         varchar          β  decimal(9,2)   β     int64     β          varchar           β          varchar          β
βββββββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ€
β floorspace  β VARCHAR                  β           98.81 β      31981048 β -4.910890285223722         β 9999.999575164324         β
β height      β VARCHAR                  β           24.78 β         31689 β H:1                        β HHT:99.95                 β
β id          β BIGINT                   β            0.00 β    1985306599 β -17787084                  β 2304182979                β
β last_update β TIMESTAMP WITH TIME ZONE β            0.00 β      18555352 β 2024-09-25 17:23:12.185+03 β 2024-10-09 15:08:17.53+03 β
β occupancy   β VARCHAR                  β            0.00 β            42 β AGR                        β UNK                       β
β quadkey     β VARCHAR                  β            0.00 β     354530921 β 002202022321313300         β 331301200330323102        β
β relation_id β VARCHAR                  β           99.99 β         57371 β -10000436.0                β 998911012.0               β
β source      β VARCHAR                  β            0.00 β             3 β Google                     β OSM                       β
βββββββββββββββ΄βββββββββββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
Iβll generate a heatmap of the building footprints in this dataset.
$ ~/duckdb obm.duckdb
CREATE OR REPLACE TABLE h3_4_stats AS
SELECT   H3_LATLNG_TO_CELL(
bbox.ymin,
bbox.xmin, 4) AS h3_4,
COUNT(*) num_buildings
FROM     'building.*.parquet'
GROUP BY 1;
COPY (
SELECT ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(h3_4)::geometry) geometry,
num_buildings
FROM   h3_4_stats
WHERE  ST_XMIN(geometry::geometry) BETWEEN -179 AND 179
AND    ST_XMAX(geometry::geometry) BETWEEN -179 AND 179
) TO 'h3_4_stats.parquet' (
FORMAT 'PARQUET',
CODEC  'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
There are very few Chinese buildings in this dataset.
Below is the TUM dataset which used satellite imagery of China to help detect as many building footprints as possible. The heatmap below shows Chinaβs South East has a building density on par with most of its neighbouring countries.
Building Uses
There are eight different building use classifications in this dataset. The uses and their acronyms are residential (RES), commercial and public (COM), mixed use (MIX), industrial (IND), agricultural (AGR), assembly (ASS), government (GOV) and education (EDU). UNK stands for unknown and accounts for ~60% of the buildings in this dataset.
After the 3-letter building use acronym there can be suffixes that provide more detail, such as RES2, which means the residential building is an apartment building. The βBuilding occupancy typeβ section of the paper goes into more detail.
Below are the building counts by use and suffix.
WITH a AS (
SELECT   num_recs: COUNT(*),
use:      occupancy[:3],
use_ext:  occupancy[4:]
FROM     'building.*.parquet'
GROUP BY 2, 3
ORDER BY 2, 3
)
PIVOT    a
ON       use
USING    SUM(num_recs)
GROUP BY use_ext
ORDER BY TRY_CAST(REGEXP_REPLACE(use_ext, '[A-Z]', '') AS INT);
βββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββββ¬βββββββββββββ
β use_ext β   AGR    β   ASS   β   COM   β   EDU   β   GOV   β   IND    β   MIX   β    RES    β    UNK     β
β varchar β  int128  β int128  β int128  β int128  β int128  β  int128  β int128  β  int128   β   int128   β
βββββββββββΌβββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββΌββββββββββΌββββββββββββΌβββββββββββββ€
β 1       β   785006 β 1982076 β 4053587 β   93893 β  516140 β   934295 β 4072344 β  64826183 β       NULL β
β 2A      β     NULL β    NULL β    NULL β    NULL β    NULL β     NULL β    NULL β    120885 β       NULL β
β 2       β   189285 β  182617 β  161023 β 5513273 β  190597 β    15282 β    9524 β   8874805 β       NULL β
β 3       β  1199083 β   22159 β  422796 β 1813089 β    NULL β     NULL β    NULL β  16615755 β       NULL β
β 4       β     NULL β  106813 β 1774918 β    8714 β    NULL β     NULL β 1833277 β     32207 β       NULL β
β 5       β     NULL β    NULL β  643386 β    NULL β    NULL β     NULL β  724050 β      NULL β       NULL β
β 6       β     NULL β    NULL β  200419 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β 7       β     NULL β    NULL β 1641089 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β 8       β     NULL β    NULL β   18887 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β 9       β     NULL β    NULL β   99485 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β 10      β     NULL β    NULL β  122578 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β 11      β     NULL β    NULL β 1662269 β    NULL β    NULL β     NULL β    NULL β      NULL β       NULL β
β         β 38664645 β     222 β 4062284 β   53091 β 2786516 β 22215578 β    NULL β 861941989 β 1642025155 β
βββββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββββ΄βββββββββββββ€
β 13 rows                                                                                       10 columns β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Below are the building use categories for each building in Tallinnβs Old Town.
Building Sources
Google is the source for almost 60% of the building footprints. OSM accounts for 23% and Microsoft for 18%.
SELECT   COUNT(*),
source
FROM     'building.*.parquet'
GROUP BY 2
ORDER BY 1 DESC;
ββββββββββββββββ¬ββββββββββββ
β count_star() β  source   β
β    int64     β  varchar  β
ββββββββββββββββΌββββββββββββ€
β   1597737974 β Google    β
β    612804756 β OSM       β
β    482668539 β Microsoft β
ββββββββββββββββ΄ββββββββββββ
OSM had 657M buildings in its dataset a few weeks ago so its interesting to see some ~45M of its buildings havenβt made it into OBM.
Below Iβve highlighted the most common source for each H3 hexagon at zoom level 5.
CREATE OR REPLACE TABLE h3_5s AS
WITH b AS (
WITH a AS (
SELECT   H3_LATLNG_TO_CELL(bbox.ymin,
bbox.xmin,
5) h3_5,
source,
COUNT(*) num_recs
FROM     'building.*.parquet'
GROUP BY 1, 2
)
SELECT *,
ROW_NUMBER() OVER (PARTITION BY h3_5
ORDER BY     num_recs DESC) AS rn
FROM   a
)
FROM     b
WHERE    rn = 1
ORDER BY num_recs DESC;
COPY (
SELECT geom: H3_CELL_TO_BOUNDARY_WKT(h3_5)::GEOMETRY,
source
FROM   h3_5s
WHERE  ST_XMIN(geom::GEOMETRY) BETWEEN -179 AND 179
AND    ST_XMAX(geom::GEOMETRY) BETWEEN -179 AND 179
) TO 'obm.top_sources.parquet' (
FORMAT 'PARQUET',
CODEC  'ZSTD',
COMPRESSION_LEVEL 22,
ROW_GROUP_SIZE 15000);
Microsoft is red, Google is green and OSM is yellow.
For spatial context, youβd only need a dozen or so of these zoom level 5 hexagons to cover Dubaiβs urban areas.
Building Heights
Both building height and/or the number of stories are contained within the height field. In some cases, ranges instead of absolute values are used. The field has the ability to express how many stories below ground a building extends as well. The βBuilding heightβ section in the paper goes into detail.
Below are the building counts by the initial height prefix and if the multiple measurements delimiter is present.
SELECT   prefix:        SPLIT(height, ':')[1],
multi:         height LIKE '%+%',
num_buildings: COUNT(*)
FROM     'building.*.parquet'
GROUP BY 1, 2
ORDER BY 3 DESC;
βββββββββββ¬ββββββββββ¬ββββββββββββββββ
β prefix  β  multi  β num_buildings β
β varchar β boolean β     int64     β
βββββββββββΌββββββββββΌββββββββββββββββ€
β HBET    β false   β    1879363563 β
β NULL    β NULL    β     667500586 β
β HHT     β false   β     114324028 β
β H       β false   β      27654116 β
β H       β true    β       4360694 β
β HBEX    β false   β          8036 β
β HBEX    β true    β           246 β
βββββββββββ΄ββββββββββ΄ββββββββββββββββ
~25% of the buildings in this dataset donβt have any height data.
Below Iβve colour-coded the buildings in Tallinnβs Old Town by their height prefix. The βHβ records in red contain the number of stories, the βHBETβ records in orange contain a range of stories, the βHHTβ records in light green contain the height in meters and the blue buildings have no height information.
Below are an example of combination values, delimited by a plus sign.
.maxrows 10
SELECT   height,
COUNT(*)
FROM     'building.*.parquet'
WHERE    height LIKE '%+%'
GROUP BY 1
ORDER BY 2 DESC;
βββββββββββββββββββββββββ¬βββββββββββββββ
β        height         β count_star() β
β        varchar        β    int64     β
βββββββββββββββββββββββββΌβββββββββββββββ€
β H:1+HHT:3.50          β       903295 β
β H:2+HHT:7.00          β       395789 β
β H:1+HHT:3.00          β       261288 β
β H:2+HHT:6.00          β       136564 β
β H:1+HHT:4.00          β       115217 β
β H:2+HBEX:1            β        67691 β
β H:1+HBEX:1            β        59680 β
β H:1+HHT:5.00          β        51128 β
β H:1+HHT:2.80          β        50296 β
β H:2+HHT:8.00          β        48202 β
β H:1+HHT:3.30          β        48074 β
β H:1+HHT:3.40          β        47996 β
β H:1+HHT:3.60          β        47382 β
β H:3+HBEX:1            β        46986 β
β H:1+HHT:3.20          β        43824 β
β H:1+HHT:3.70          β        43422 β
β H:1+HHT:3.80          β        40676 β
β H:1+HHT:3.10          β        39174 β
β H:3+HHT:9.00          β        38437 β
β H:1+HHT:6.00          β        37171 β
β      Β·                β            Β· β
β      Β·                β            Β· β
β      Β·                β            Β· β
β H:2+HBEX:1+HHT:7.40   β            1 β
β H:33+HHT:10.00        β            1 β
β H:17+HBEX:1+HHT:49.60 β            1 β
β H:15+HHT:4.80         β            1 β
β H:15+HHT:71.50        β            1 β
β H:1+HHT:15.51         β            1 β
β H:42+HHT:136.55       β            1 β
β H:60+HHT:239.12       β            1 β
β H:53+HHT:229.00       β            1 β
β H:39+HHT:98.00        β            1 β
β H:8+HHT:2.90          β            1 β
β H:35+HHT:82.00        β            1 β
β H:7+HHT:12.20         β            1 β
β H:4+HBEX:1+HHT:13.50  β            1 β
β H:25+HHT:96.36        β            1 β
β H:3+HBEX:1+HHT:10.55  β            1 β
β H:32+HHT:142.00       β            1 β
β H:43+HHT:147.70       β            1 β
β H:11+HHT:35.09        β            1 β
β H:65+HHT:258.00       β            1 β
βββββββββββββββββββββββββ΄βββββββββββββββ€
β 22423 rows (40 shown)      2 columns β
ββββββββββββββββββββββββββββββββββββββββ
It would be nice to have separate columns for height in meters and stories both going above and below ground. Iβll have to analyse this column further to figure out how complex that task would be.
Canadian Coverage
Iβll download a rough outline of Canadaβs provinces and territories in GeoJSON format. Iβll then convert it into GPKG as GPKG files require less syntax to work with in DuckDB.
$ wget -c https://gist.github.com/Thiago4breu/6ba01976161aa0be65e0a289412dc54c/raw/8ec57d8317a2abe5bae18e5fd86f777fab649f84/canada-provinces.geojson
$ ogr2ogr \
-f GPKG \
canada-provinces.gpkg \
canada-provinces.geojson
$ ~/duckdb
CREATE OR REPLACE TABLE canada AS
FROM ST_READ('canada-provinces.gpkg');
SELECT    COUNT(*)
FROM      'building.*.parquet' b
LEFT JOIN canada c ON ST_COVEREDBY(b.geometry, c.geom)
WHERE     c.name IS NOT NULL;
The above returned a count of 13,239,095 Canadian buildings in OBM.
For comparison, the ODB dataset I reviewed last month contains 14.4M buildings in Canada, PSC has 13.7M, TUM has 13.4M and the Layercake Project, which produces weekly exports of OSM data into Parquet and hosts via Cloudflareβs CDN, has 7.8M Canadian buildings.
Below on top of Googleβs Satellite Map in yellow is OBM and in purple is Overtureβs October building footprints for South East Calgary. There are several areas where OBM has gaps in its coverage.
With that said, cities like Calgary are growing fast and new neighbourhoods can be waiting a while before they appear in any public datasets.
I spotted a few cases where a OBM building footprint will cover two or three properties.
This Mall, just north of Calgaryβs Airports, has ODM building footprints on top of one another.
For reference, Overtureβs October release contains 13,453,441 buildings for Canada.
SELECT    COUNT(*)
FROM      's3://overturemaps-us-west-2/release/2025-10-22.0/theme=buildings/type=building/part-000*.parquet' b
LEFT JOIN canada c ON ST_COVEREDBY(ST_POINT(bbox.xmin,
bbox.ymin),
c.geom)
WHERE     c.name IS NOT NULL;
Note: Overture has 236 Parquet files for its building footprints dataset. If I scan all of those files, the query above will exhaust my systemβs RAM capacity. So to get around this, the URL glob limits the query to just the first 100 Parquet files. The Parquet files are spatially sorted so Canadaβs data should usually live in the lower file numbers.
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If youβd like to discuss how my offerings can help your business please contact me via LinkedIn.