Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)
If you’re building RAG pipelines or document ingestion for LLM agents, you’ve probably solved the easy part already. Modern Office files? No problem. python-docx, openpyxl, python-pptx — pick your library, extract your text, move on.
Then someone points your pipeline at an enterprise SharePoint.
The Legacy File Problem
Enterprise SharePoints are digital archaeology sites. Marketing uploaded PowerPoints in 2008. Legal has Word documents from 2005. Finance runs on Excel files that predate most of your team’s careers.
These aren’t edge cases. In my experience, legacy .doc, .xls, and .ppt files make up a significant chunk of any long-running enterprise document store. A…
Why I Built a Pure Python Library for Legacy Office Files (And Why RAG Pipelines Need One)
If you’re building RAG pipelines or document ingestion for LLM agents, you’ve probably solved the easy part already. Modern Office files? No problem. python-docx, openpyxl, python-pptx — pick your library, extract your text, move on.
Then someone points your pipeline at an enterprise SharePoint.
The Legacy File Problem
Enterprise SharePoints are digital archaeology sites. Marketing uploaded PowerPoints in 2008. Legal has Word documents from 2005. Finance runs on Excel files that predate most of your team’s careers.
These aren’t edge cases. In my experience, legacy .doc, .xls, and .ppt files make up a significant chunk of any long-running enterprise document store. And if you’re building a system that needs to ingest "all the documents," you can’t just skip them.
Why Existing Solutions Didn’t Work for Me
I needed to process these files in AWS Lambda functions for a RAG pipeline. My options were:
LibreOffice
The standard answer. Install LibreOffice, run it headless, convert files to text. It works, but it adds over 1GB to your container image. Lambda has a 250MB limit for deployment packages (10GB with container images, but still). Plus, configuring headless LibreOffice is its own adventure.
Apache Tika
Solid tool, widely used. But it requires a Java runtime and typically runs as a separate server. That’s another service to deploy, monitor, and secure. For a document extraction step in a pipeline, it felt like overkill.
Subprocess calls to command-line tools
Various tools exist that you can shell out to. But subprocess calls are a security concern, they break in restricted environments, and they make your code platform-dependent.
I wanted something simpler: a Python library I could pip install and call.
Building sharepoint-to-text
So I built sharepoint-to-text.
The core idea: parse both legacy Office binary formats (OLE2) and modern XML-based formats (OOXML) directly in Python. No external dependencies. No subprocess calls. Just text extraction.
import sharepoint2text
# Works the same for legacy or modern files
result = sharepoint2text.read_file("ancient_report.doc")
result = sharepoint2text.read_file("modern_report.docx")
It handles:
- Legacy formats:
.doc,.xls,.ppt - Modern formats:
.docx,.xlsx,.pptx - Plus
.pdfand plain text
One interface, no conditional logic, no format detection boilerplate in your code.
Why This Matters for RAG and LLM Agents
If you’re building document ingestion for RAG, you’re probably dealing with heterogeneous input. Users upload files. Pipelines crawl document stores. You can’t control what formats show up.
The typical approach is a cascade of if-statements and multiple libraries:
# The ugly version
if path.endswith('.docx'):
text = extract_with_python_docx(path)
elif path.endswith('.doc'):
text = extract_with_libreoffice(path) # hope it's installed
elif path.endswith('.xlsx'):
text = extract_with_openpyxl(path)
# ... and so on
With sharepoint-to-text, it’s just:
import sharepoint2text
result = sharepoint2text.read_file(path)
The library figures out the format and handles it appropriately.
Deployment Benefits
Because it’s pure Python with no system dependencies:
- Container images stay small — no LibreOffice bloat
- Serverless-friendly — works in Lambda, Cloud Functions, Azure Functions
- No security concerns — no subprocess calls, no shell execution
- Cross-platform — Windows, macOS, Linux, whatever
When You Might Need This
- You’re building RAG pipelines against enterprise document stores
- Your LLM agent needs to process user-uploaded files of unknown vintage
- You’re deploying to serverless with size constraints
- Your security team doesn’t allow subprocess execution
- You’re tired of maintaining LibreOffice in containers
Try It Out
pip install sharepoint-to-text
GitHub: https://github.com/Horsmann/sharepoint-to-text
I’d appreciate feedback, especially if you hit edge cases with specific file types. Legacy Office formats are notoriously inconsistent, and real-world files are the best test suite.