From Text to Map: A Reproducible Geocoding Pipeline for Ottoman Studies

This paper presents a computational pipeline for working with spatial data in Ottoman Turkish, from extracting place names via named-entity recognition (NER) to geocoding and, finally, mapping toponyms.

Introduction

With the “spatial turn,” Geographic Information Systems (GIS) have been increasingly leveraged in Digital Humanities, offering new directions for studying historical texts. Previous studies (e.g., Emiralioğlu 2019) underscored the depth of Ottoman geographic knowledge. In this context, computationally investigating the spatial limits of this knowledge via GIS can provide new insights and confirm the qualitative findings with quantitative evidence. Although many studies deployed GIS on Ottoman Turkish texts before (Ma 2021, Yaycıoğlu et al. 2022), this paper provides a reproducible pipeline for analysing Ottoman Turkish texts.

By presenting an end-to-end automated mapping pipeline consisting of extracting entities, geocoding, and mapping them, this paper answers the question of how we can start getting the raw data for mapping. The scripts utilised in this study are also shared openly so that readers can apply the same pipeline to their data.

Data and Resources

Although Ottoman Turkish is no longer a living language, a substantial amount of data is openly available online. However, the degree of standardization across these data sources varies considerably. While Ottoman Turkish was originally written in the Perso-Arabic script, texts are often transliterated into the Latin alphabet. In practice, no single standard is consistently applied: some texts indicate only long vowels while providing consonants according to the modern Turkish alphabet, whereas others follow the IJMES transliteration chart. The IJMES system minimizes information loss when mapping the Ottoman Turkish script onto Latin characters. In this study, I therefore utilize texts transliterated according to the IJMES standard. Since the named entity recognition (NER) model discussed below was also trained on texts in this format, input data provided in other transliteration schemes may not yield good results.

As potential data sources, readers can check the Latin-transliterated manuscripts held by the Presidency of the Manuscripts Institution of Turkey, which is the largest Latin-transliterated Ottoman Turkish data source. Another data source is the DUDU treebank, the largest annotated Latin-transcribed Ottoman Turkish treebank with 1,782 sentences and 17,125 words (Yılandiloğlu and Siewert 2025).

Loading more...