Optical Character Recognition (OCR) is a powerful AI/ML technology that recognizes and extracts text from images and scanned documents.
Creating a scalable, event-driven web OCR service comes with challenges. This write-up details the problems, lessons and solutions uncovered while building a FastAPI + Celery + Redis + PaddleOCR OCR service aimed for integration with Paperless-ngx, an open source document management.
What I Wanted to Build (and Why)
Our objective was to build an event-driven service that efficiently converts PDFs or images into searchable PDFs with a selectable and searchable text layer. The focus was on:
- Handling PDFs of arbitrary length and complexity.
- Delivering results asynchronously due to CPU-heavy OCR tasks.
- Creating outputs integr…
Optical Character Recognition (OCR) is a powerful AI/ML technology that recognizes and extracts text from images and scanned documents.
Creating a scalable, event-driven web OCR service comes with challenges. This write-up details the problems, lessons and solutions uncovered while building a FastAPI + Celery + Redis + PaddleOCR OCR service aimed for integration with Paperless-ngx, an open source document management.
What I Wanted to Build (and Why)
Our objective was to build an event-driven service that efficiently converts PDFs or images into searchable PDFs with a selectable and searchable text layer. The focus was on:
- Handling PDFs of arbitrary length and complexity.
- Delivering results asynchronously due to CPU-heavy OCR tasks.
- Creating outputs integrable with Paperless-ngx for document archiving and retrieval.
Why a Simple Script is just not good enough
OCR workloads demand significant compute power, especially on large or image-heavy PDFs. The process involves:
- OCR inference: Detecting and recognizing text from images - the most CPU intensive part.
- Collating results: Combining recognized text from many pages.
- Embedding a text layer: Creating PDFs with searchable text overlay, crucial for usability.
Making this scalable and responsive requires moving beyond a simple blocking script into asynchronous, event-driven architecture. Multiprocessing seems like a natural fit at first, but Celery and PaddleOCR takes care of workload and performance respectively as you’ll see below. Keep reading.
The Architecture (How All the Pieces Fit Together)
Flow:
Client uploads PDF → FastAPI returns task ID immediately 1.
FastAPI enqueues task in Redis Broker 1.
Celery Workers pick up tasks, use PaddleOCR (cached per process) 1.
Workers store searchable PDFs in File Storage 1.
Redis Backend tracks task status 1.
Client polls FastAPI → gets status + download link
This architecture scales by adding more Celery Workers and handles OCR’s CPU intensity through async processing.
FastAPI ──► Redis Broker ──► Celery Workers ──► PaddleOCR ──► File Storage
▲ ▲ Result Backend ▲ Cached models ▲
└────────┘ └────────────────────┘
How Celery Actually Works in this Setup (And Surprises)
Celery orchestrates asynchronous OCR processing - points 3 and 4 in the flow above and here this ets very interesting:
-
Orchestrate:
-
takes in PDF/image input file
-
converts PDF to list of images (OCR needs images)
-
Decides on size of task (Files with > 5 pages gets delegated to chord)
-
Calls
process_single_page. -
Finally calls
assemble_final_pdfwhich returns results. -
Process single page:
-
Create
ocr_engine = get_ocr_engine()and get ocr results. -
Create text files with OCRed texts (We need raw text as well).
-
Creates single paged PDF file with selectable and searchable text layer.
-
Returns page index and file path.
-
Final assembly:
-
Receives all single paged PDFs’s paths
-
Collates/merges all in one resulting PDF.
-
Merges all text files in one.
-
Cleans up temp PDF/text files.
-
Returns final PDF and text file URLs.
Visual Overview of the Celery Pipeline
+------------------+
Upload PDF → | FastAPI |
+------------------+
|
v
[Redis Message Broker]
|
v
+---------------------------+
| Orchestrator Task |
| (orchestrate_pdf_ocr) |
+---------------------------+
|
+-----------------+-----------------+
| | |
v v v
+-----------+ +-----------+ +---------------+
| Page 0 | | Page 1 | | Page N |
| OCR Task | | OCR Task | | OCR Task |
+-----------+ +-----------+ +---------------+
\ | /
\ | /
+-----------+--------------+
|
v
+-----------------------------------+
| assemble_final_pdf (Callback) |
+-----------------------------------+
|
v
Searchable PDF + merged text file saved
Pitfall: Some might argue - why not pass on OCR results to #3 "Final assembly" step above to do the final assembly of PDF and text file? Considered that and found that PaddleOCR results are big nested data-structure with NumPy numpy.ndarray which need custom recursive serialization for Redis.
I briefly experimented with passing lightweight structured results (page text + bounding boxes), but even that ballooned in size on longer PDFs. I concluded serialization was a headache and creating single pages PDFs appealed to me more due to several reasons:
Tiny Payload Size: Instead of serializing huge, complex nested lists of coordinates and text (which stresses Redis/Celery result backend), you just pass a tiny string: "/tmp/page_5_ocr.pdf".
Solves Serialization: The complex OCR data stays in memory, gets written to PDF immediately, and is discarded.
Retires/Check points: If the final assembly task fails, you still have the individual page PDFs on disk. You could technically inspect or re-assemble them manually. Retry only that page which failed.
Assembly: The final assemble_final_pdf task becomes extremely cheap.
Redis: No memory pressure on Redis
Cons:
- Cleanup: Requires careful temp file management
- Shared Volume: If you are on a cluster (K8s/multiple VMs), you need a shared volume.
PaddleOCR: Model Caching and Threading
⚠️ Lessons learnt: PaddleOcr has a known issue with singleton objects - initializing once and reusing PaddleOCR will most certainly fail for subsequent OCR requests. Solution is to cache models and re-initialize it for every call - a slight overhead. I lost quite a few hairs scratching my head over this 😉
Caching models gets a big speedup: rather than reload PaddleOCR models for each process, we use a model cache so each process loads the model once and reuses it.
model_cache = os.environ.get("PPDX_HOME", "/app/model-cache")
logger.info(f"✨ Using model cache from: {model_cache}")
def get_ocr_engine() -> PaddleOCR:
return PaddleOCR(
text_recognition_model_dir=f"{model_cache}/.paddlex/official_models/PP-OCRv5_server_rec",
text_detection_model_dir=f"{model_cache}/.paddlex/official_models/PP-OCRv5_server_det",
textline_orientation_model_dir=f"{model_cache}/.paddlex/official_models/PP-LCNet_x1_0_textline_ori",
doc_orientation_classify_model_dir=f"{model_cache}/.paddlex/official_models/PP-LCNet_x1_0_doc_ori",
doc_unwarping_model_dir=f"{model_cache}/.paddlex/official_models/UVDoc",
use_doc_unwarping=False,
use_doc_orientation_classify=False,
use_textline_orientation=False,
)
PaddleOCR itself is inherently multi-threaded via its native C++ inference engine, relying on optimized libraries like MKL and oneDNN. These libraries internally run on multiple CPU threads and bypass Python’s GIL, enabling you to get most out of CPU cores available to you using option cpu_threads in init.
Redis: The Glue
Redis acts as both:
- The message broker queuing tasks.
- The result backend tracking task status and storing outputs.
This decouples FastAPI from OCR workers, enabling scalability and fault tolerance.
FastAPI: The Front Door to the OCR Service
FastAPI:
- Accepts PDF uploads and immediately returns a task ID.
- Provides endpoints to poll for task status and download results.
- Delegates heavy processing to the event-driven Celery workers.
What Didn’t Work (and Why)
PaddleOCR singleton failing: A known issue with PaddleOCR - it fails on subsequent OCR calls, most likely because it retains state from previous calls and needs a reset. And reset means almost same overhead of re-creating the object.
Serialization of large numpy structures: Recursive serialization of nested numpy data types was an option but seemed too much os headache.
Shared filesystem: This is top on todo list as its necessary to make this horizontally scalable.
What I Learned While Building This
- Keep the Celery payload small. I was surprised how easy it was to create multiple files and re-assemble in different Celery workers
- PaddleOCR is good, but it has quirks. Don’t fight the library - work around it.
- Celery chords turned out to be the perfect fit for multi-page PDFs, but it took me a while to get the signatures right.
Final notes / Next Steps
Core Pipeline Improvements
- Retries & Error Handling
- Per-page retries with backoff
- Custom exceptions for OCR failures
- Fail-fast if >N pages fail
- Cleanup orphan files
- Task Timeouts
- Timeout for each OCR task
- Timeout for orchestration/chord
- Deadline propagation
- Progress Reporting
- Track completed_pages / total_pages
- Publish progress to Redis
- FastAPI poll endpoint or SSE/WebSocket
- Distributed Pipeline (in progress)
- Add shared volume or S3/MinIO
- Convert file paths to storage URIs
- Remove reliance on local disk per worker
Building this thing reminded me that OCR isn’t just text extraction - it’s a messy mix of CPU bottlenecks, weird library quirks, and architectural decisions that don’t show up in tutorials.
Turns out, building a ‘simple OCR service’ is anything but simple — but now it’s fast, scalable, and plays nicely with Paperless-ngx.