oxidize-pdf
A pure Rust PDF generation and manipulation library with zero external PDF dependencies. Production-ready for basic PDF functionality. Generate PDFs 2x faster than PDFSharp, with memory safety guarantees and a 5.2MB binary size.
Features
- π 100% Pure Rust - No C dependencies or external PDF libraries
- π PDF Generation - Create multi-page documents with text, graphics, and images
- π PDF Parsing - Read and extract content from existing PDFs (tested on 759 real-world PDFs*)
- π‘οΈ Corruption Recovery - Robust error recovery for damaged or malformed PDFs (98.8% success rate)
- βοΈ PDF Operations - Split, merge, and rotate PDFs while preserving basic content
- πΌοΈ Image Support - Embed JPEG and PNG images with automatic compression
- π¨ **Trβ¦
oxidize-pdf
A pure Rust PDF generation and manipulation library with zero external PDF dependencies. Production-ready for basic PDF functionality. Generate PDFs 2x faster than PDFSharp, with memory safety guarantees and a 5.2MB binary size.
Features
- π 100% Pure Rust - No C dependencies or external PDF libraries
- π PDF Generation - Create multi-page documents with text, graphics, and images
- π PDF Parsing - Read and extract content from existing PDFs (tested on 759 real-world PDFs*)
- π‘οΈ Corruption Recovery - Robust error recovery for damaged or malformed PDFs (98.8% success rate)
- βοΈ PDF Operations - Split, merge, and rotate PDFs while preserving basic content
- πΌοΈ Image Support - Embed JPEG and PNG images with automatic compression
- π¨ Transparency & Blending - Full alpha channel, SMask, blend modes for watermarking and overlays
- π CJK Text Support - Chinese, Japanese, and Korean text rendering and extraction with ToUnicode CMap
- π¨ Rich Graphics - Vector graphics with shapes, paths, colors (RGB/CMYK/Gray)
- π Advanced Text - Custom TTF/OTF fonts, standard fonts, text flow with automatic wrapping, alignment
- π °οΈ Custom Fonts - Load and embed TrueType/OpenType fonts with full Unicode support
- π OCR Support - Extract text from scanned PDFs using Tesseract OCR (v0.1.3+)
- π€ AI/RAG Integration - Document chunking for LLM pipelines with sentence boundaries and metadata (v1.3.0+)
- ποΈ Compression - Built-in FlateDecode compression for smaller files
- π Type Safe - Leverage Rustβs type system for safe PDF manipulation
π Whatβs New
Latest: v1.3.0 - AI/RAG Integration:
- π€ Document Chunking for LLMs - Production-ready chunking with 0.62ms for 100 pages
- π Rich Metadata - Page tracking, position info, confidence scores
- βοΈ Smart Boundaries - Sentence boundary detection for semantic coherence
- β‘ Exceptional Performance - 161x better than target, 160K pages/second throughput
- π Complete Examples - RAG pipeline with embeddings and vector store integration
Production-Ready Features (v1.2.3-v1.2.5):
-
π‘οΈ Corruption Recovery - Comprehensive error recovery system (v1.1.0+, polished in v1.2.3)
-
Automatic XRef table rebuild for broken cross-references
-
Lenient parsing mode with multiple recovery strategies
-
Partial content extraction from damaged files
-
98.8% success rate on 759 real-world PDFs
-
π¨ PNG Transparency - Full transparency support (v1.2.3)
-
PNG images with alpha channels
-
SMask (Soft Mask) generation
-
16 blend modes (Normal, Multiply, Screen, Overlay, etc.)
-
Opacity control and watermarking capabilities
-
π CJK Text Support - Complete Asian language support (v1.2.3-v1.2.4)
-
Chinese (Simplified & Traditional), Japanese, Korean
-
CMap parsing and ToUnicode generation
-
Type0 fonts with CID mapping
-
UTF-16BE encoding with Adobe-Identity-0
Major features (v1.1.6+):
- π °οΈ Custom Font Support - Load TTF/OTF fonts from files or memory
- βοΈ Advanced Text Formatting - Character spacing, word spacing, text rise, rendering modes
- π Clipping Paths - Both EvenOdd and NonZero winding rules
- πΎ In-Memory Generation - Generate PDFs without file I/O using
to_bytes()
- ποΈ Compression Control - Enable/disable compression with
set_compress()
Significant improvements in PDF compatibility:
- π Better parsing: Handles circular references, XRef streams, object streams
- π‘οΈ Stack overflow protection - Production-ready resilience against malformed PDFs
- π Performance: 215+ PDFs/second processing speed
- β‘ Error recovery - Multiple fallback strategies for corrupted files
- π§ Lenient parsing - Graceful handling of malformed structures
- πΎ Memory optimization:
OptimizedPdfReader
with LRU cache
Note: *Success rates apply only to non-encrypted PDFs with basic features. The library provides basic PDF functionality. See Known Limitations for a transparent assessment of current capabilities and planned features.
π Why oxidize-pdf?
Performance & Efficiency
- 2x faster than PDFSharp - Process 215 PDFs/second
- 5.2 MB binary - 3x smaller than PDFSharp, 40x smaller than IronPDF
- Zero dependencies - No runtime, no Chrome, just a single binary
- Low memory usage - Efficient streaming for large PDFs
Safety & Reliability
- Memory safe - Guaranteed by Rust compiler (no null pointers, no buffer overflows)
- Type safe API - Catch errors at compile time
- 3,000+ tests - Comprehensive test suite with real-world PDFs
- No CVEs possible - Memory safety eliminates entire classes of vulnerabilities
Developer Experience
- Modern API - Designed in 2024, not ported from 2005
- True cross-platform - Single binary runs on Linux, macOS, Windows, ARM
- Easy deployment - One file to ship, no dependencies to manage
- Fast compilation - Incremental builds in seconds
Quick Start
Add oxidize-pdf to your Cargo.toml
:
[dependencies]
oxidize-pdf = "1.1.0"
# For OCR support (optional)
oxidize-pdf = { version = "1.1.0", features = ["ocr-tesseract"] }
Basic PDF Generation
use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
// Create a new document
let mut doc = Document::new();
doc.set_title("My First PDF");
doc.set_author("Rust Developer");
// Create a page
let mut page = Page::a4();
// Add text
page.text()
.set_font(Font::Helvetica, 24.0)
.at(50.0, 700.0)
.write("Hello, PDF!")?;
// Add graphics
page.graphics()
.set_fill_color(Color::rgb(0.0, 0.5, 1.0))
.circle(300.0, 400.0, 50.0)
.fill();
// Add the page and save
doc.add_page(page);
doc.save("hello.pdf")?;
Ok(())
}
AI/RAG Document Chunking (v1.3.0+)
use oxidize_pdf::ai::DocumentChunker;
use oxidize_pdf::parser::{PdfReader, PdfDocument};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Load and parse PDF
let reader = PdfReader::open("document.pdf")?;
let pdf_doc = PdfDocument::new(reader);
let text_pages = pdf_doc.extract_text()?;
// Prepare page texts with page numbers
let page_texts: Vec<(usize, String)> = text_pages
.iter()
.enumerate()
.map(|(idx, page)| (idx + 1, page.text.clone()))
.collect();
// Create chunker: 512 tokens per chunk, 50 tokens overlap
let chunker = DocumentChunker::new(512, 50);
let chunks = chunker.chunk_text_with_pages(&page_texts)?;
// Process chunks for RAG pipeline
for chunk in chunks {
println!("Chunk {}: {} tokens", chunk.id, chunk.tokens);
println!(" Pages: {:?}", chunk.page_numbers);
println!(" Position: chars {}-{}",
chunk.metadata.position.start_char,
chunk.metadata.position.end_char);
println!(" Sentence boundary: {}",
chunk.metadata.sentence_boundary_respected);
// Send to embedding API, store in vector DB, etc.
// let embedding = openai.embed(&chunk.content)?;
// vector_db.insert(chunk.id, embedding, chunk.content)?;
}
Ok(())
}
Custom Fonts Example
use oxidize_pdf::{Document, Page, Font, Color, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
doc.set_title("Custom Fonts Demo");
// Load a custom font from file
doc.add_font("MyFont", "/path/to/font.ttf")?;
// Or load from bytes
let font_data = std::fs::read("/path/to/font.otf")?;
doc.add_font_from_bytes("MyOtherFont", font_data)?;
let mut page = Page::a4();
// Use standard font
page.text()
.set_font(Font::Helvetica, 14.0)
.at(50.0, 700.0)
.write("Standard Font: Helvetica")?;
// Use custom font
page.text()
.set_font(Font::Custom("MyFont".to_string()), 16.0)
.at(50.0, 650.0)
.write("Custom Font: This is my custom font!")?;
// Advanced text formatting with custom font
page.text()
.set_font(Font::Custom("MyOtherFont".to_string()), 12.0)
.set_character_spacing(2.0)
.set_word_spacing(5.0)
.at(50.0, 600.0)
.write("Spaced text with custom font")?;
doc.add_page(page);
doc.save("custom_fonts.pdf")?;
Ok(())
}
Parse Existing PDF
use oxidize_pdf::{PdfReader, Result};
fn main() -> Result<()> {
// Open and parse a PDF
let mut reader = PdfReader::open("document.pdf")?;
// Get document info
println!("PDF Version: {}", reader.version());
println!("Page Count: {}", reader.page_count()?);
// Extract text from all pages
let document = reader.into_document();
let text = document.extract_text()?;
for (page_num, page_text) in text.iter().enumerate() {
println!("Page {}: {}", page_num + 1, page_text.content);
}
Ok(())
}
Working with Images & Transparency
use oxidize_pdf::{Document, Page, Image, Result};
use oxidize_pdf::graphics::TransparencyGroup;
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Load a JPEG image
let image = Image::from_jpeg_file("photo.jpg")?;
// Add image to page
page.add_image("my_photo", image);
// Draw the image
page.draw_image("my_photo", 100.0, 300.0, 400.0, 300.0)?;
// Add watermark with transparency
let watermark = TransparencyGroup::new().with_opacity(0.3);
page.graphics()
.begin_transparency_group(watermark)
.set_font(oxidize_pdf::text::Font::HelveticaBold, 48.0)
.begin_text()
.show_text("CONFIDENTIAL")
.end_text()
.end_transparency_group();
doc.add_page(page);
doc.save("image_example.pdf")?;
Ok(())
}
Advanced Text Flow
use oxidize_pdf::{Document, Page, Font, TextAlign, Result};
fn main() -> Result<()> {
let mut doc = Document::new();
let mut page = Page::a4();
// Create text flow with automatic wrapping
let mut flow = page.text_flow();
flow.at(50.0, 700.0)
.set_font(Font::Times, 12.0)
.set_alignment(TextAlign::Justified)
.write_wrapped("This is a long paragraph that will automatically wrap \
to fit within the page margins. The text is justified, \
creating clean edges on both sides.")?;
page.add_text_flow(&flow);
doc.add_page(page);
doc.save("text_flow.pdf")?;
Ok(())
}
PDF Operations
use oxidize_pdf::operations::{PdfSplitter, PdfMerger, PageRange};
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Split a PDF
let splitter = PdfSplitter::new("input.pdf")?;
splitter.split_by_pages("page_{}.pdf")?; // page_1.pdf, page_2.pdf, ...
// Merge PDFs
let mut merger = PdfMerger::new();
merger.add_pdf("doc1.pdf", PageRange::All)?;
merger.add_pdf("doc2.pdf", PageRange::Pages(vec![1, 3, 5]))?;
merger.save("merged.pdf")?;
// Rotate pages
use oxidize_pdf::operations::{PdfRotator, RotationAngle};
let rotator = PdfRotator::new("input.pdf")?;
rotator.rotate_all(RotationAngle::Clockwise90, "rotated.pdf")?;
Ok(())
}
OCR Text Extraction
use oxidize_pdf::text::tesseract_provider::{TesseractOcrProvider, TesseractConfig};
use oxidize_pdf::text::ocr::{OcrOptions, OcrProvider};
use oxidize_pdf::operations::page_analysis::PageContentAnalyzer;
use oxidize_pdf::parser::PdfReader;
use oxidize_pdf::Result;
fn main() -> Result<()> {
// Open a scanned PDF
let document = PdfReader::open_document("scanned.pdf")?;
let analyzer = PageContentAnalyzer::new(document);
// Configure OCR provider
let config = TesseractConfig::for_documents();
let ocr_provider = TesseractOcrProvider::with_config(config)?;
// Find and process scanned pages
let scanned_pages = analyzer.find_scanned_pages()?;
for page_num in scanned_pages {
let result = analyzer.extract_text_from_scanned_page(page_num, &ocr_provider)?;
println!("Page {}: {} (confidence: {:.1}%)",
page_num, result.text, result.confidence * 100.0);
}
Ok(())
}
OCR Installation
Before using OCR features, install Tesseract on your system:
macOS:
brew install tesseract
brew install tesseract-lang # For additional languages
Ubuntu/Debian:
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-spa # For Spanish
sudo apt-get install tesseract-ocr-deu # For German
Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
More Examples
Explore comprehensive examples in the examples/
directory:
recovery_corrupted_pdf.rs
- Handle damaged or malformed PDFs with robust error recoverypng_transparency_watermark.rs
- Create watermarks, blend modes, and transparent overlayscjk_text_extraction.rs
- Work with Chinese, Japanese, and Korean textbasic_chunking.rs
- Document chunking for AI/RAG pipelinesrag_pipeline.rs
- Complete RAG workflow with embeddings
Run any example:
cargo run --example recovery_corrupted_pdf
cargo run --example png_transparency_watermark
cargo run --example cjk_text_extraction
Supported Features
PDF Generation
- β Multi-page documents
- β Vector graphics (rectangles, circles, paths, lines)
- β Text rendering with standard fonts (Helvetica, Times, Courier)
- β JPEG and PNG image embedding with transparency
- β Transparency groups, blend modes, and opacity control
- β RGB, CMYK, and Grayscale colors
- β Graphics transformations (translate, rotate, scale)
- β Text flow with automatic line wrapping
- β FlateDecode compression
PDF Parsing
- β PDF 1.0 - 1.7 basic structure support
- β Cross-reference table parsing with automatic recovery
- β XRef streams (PDF 1.5+) and object streams
- β Object and stream parsing with corruption tolerance
- β Page tree navigation with circular reference detection
- β Content stream parsing (basic operators)
- β Text extraction with CJK (Chinese, Japanese, Korean) support
- β CMap and ToUnicode parsing for complex encodings
- β Document metadata extraction
- β Filter support (FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode)
- β Lenient parsing with multiple error recovery strategies
PDF Operations
- β Split by pages, ranges, or size
- β Merge multiple PDFs
- β Rotate pages (90Β°, 180Β°, 270Β°)
- β Basic content preservation
OCR Support (v0.1.3+)
- β Tesseract OCR integration with feature flag
- β Multi-language support (50+ languages)
- β Page analysis and scanned page detection
- β Configurable preprocessing (denoise, deskew, contrast)
- β Layout preservation with position information
- β Confidence scoring and filtering
- β Multiple page segmentation modes (PSM)
- β Character whitelisting/blacklisting
- β Mock OCR provider for testing
- β Parallel and batch processing
Performance
- Parsing: Fast for PDFs with basic features
- Generation: Efficient for simple documents
- Memory efficient: Streaming operations available
- Pure Rust: No external C dependencies
Examples
Check out the examples directory for more usage patterns:
hello_world.rs
- Basic PDF creationgraphics_demo.rs
- Vector graphics showcasetext_formatting.rs
- Advanced text featurescustom_fonts.rs
- TTF/OTF font loading and embeddingjpeg_image.rs
- Image embeddingparse_pdf.rs
- PDF parsing and text extractioncomprehensive_demo.rs
- All features demonstrationtesseract_ocr_demo.rs
- OCR text extraction (requires--features ocr-tesseract
)scanned_pdf_analysis.rs
- Analyze PDFs for scanned contentextract_images.rs
- Extract embedded images from PDFscreate_pdf_with_images.rs
- Advanced image embedding examples
Run examples with:
cargo run --example hello_world
# For OCR examples
cargo run --example tesseract_ocr_demo --features ocr-tesseract
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see the LICENSE file for details.
Why AGPL-3.0?
AGPL-3.0 ensures that oxidize-pdf remains free and open source while protecting against proprietary use in SaaS without contribution back to the community. This license:
- β Allows free use, modification, and distribution
- β Requires sharing modifications if you provide the software as a service
- β Ensures improvements benefit the entire community
- β Supports sustainable open source development
Commercial Products & Licensing
oxidize-pdf-core is free and open source (AGPL-3.0). For commercial products and services:
Commercial Products:
- oxidize-pdf-pro: Enhanced library with advanced features
- oxidize-pdf-api: REST API server for PDF operations
- oxidize-pdf-cli: Command-line interface with enterprise capabilities
Commercial License Benefits:
- β Commercial-friendly terms (no AGPL obligations)
- β Advanced features (cloud OCR, batch processing, digital signatures)
- β Priority support and SLAs
- β Custom feature development
- β Access to commercial products (API, CLI, PRO library)
For commercial licensing inquiries, please open an issue on the GitHub repository.
Known Limitations
oxidize-pdf provides basic PDF functionality. We prioritize transparency about what works and what doesnβt.
Working Features
- β Compression: FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode (JPEG)
- β Color Spaces: DeviceRGB, DeviceCMYK, DeviceGray
- β Fonts: Standard 14 fonts + TTF/OTF custom font loading and embedding
- β Images: JPEG embedding, raw RGB/Gray data
- π§ PNG Support: Basic functionality (7 tests failing - compression issues)
- β Operations: Split, merge, rotate, page extraction, text extraction
- β Graphics: Vector operations, clipping paths, transparency (CA/ca)
- β Encryption: RC4 40/128-bit, AES-128/256 with permissions
- β Forms: Basic text fields, checkboxes, radio buttons, combo boxes, list boxes
Known Issues & Missing Features
- π PNG Compression: 7 tests consistently failing - use JPEG for now
- π§ Form Interactions: Forms can be created but not edited interactively
- β Rendering: No PDF to image conversion
- β Advanced Compression: CCITTFaxDecode, JBIG2Decode, JPXDecode
- β Advanced Graphics: Complex patterns, shadings, gradients, advanced blend modes
- β Digital Signatures: Signature fields exist but no signing capability
- β Tagged PDFs: No accessibility/structure support yet
- β Advanced Color: ICC profiles, spot colors, Lab color space
- β JavaScript: No form calculations or validation scripts
- β Multimedia: No sound, video, or 3D content support
Examples Status
Weβre actively adding more examples for core features. New examples include:
merge_pdfs.rs
- PDF merging with various optionssplit_pdf.rs
- Different splitting strategiesextract_text.rs
- Text extraction with layout preservationencryption.rs
- RC4 and AES encryption demonstrations
Important Notes
- Parsing success doesnβt mean full feature support
- Many PDFs will parse but advanced features will be ignored
- This is early beta software with significant limitations
Project Structure
oxidize-pdf/
βββ oxidize-pdf-core/ # Core PDF library (AGPL-3.0)
βββ test-suite/ # Comprehensive test suite
βββ docs/ # Documentation
β βββ technical/ # Technical docs and implementation details
β βββ reports/ # Analysis and test reports
βββ tools/ # Development and analysis tools
βββ scripts/ # Build and release scripts
βββ test-pdfs/ # Test PDF files
Commercial Products (available separately under commercial license):
- oxidize-pdf-api: REST API server for PDF operations
- oxidize-pdf-cli: Command-line interface with advanced features
- oxidize-pdf-pro: Enhanced library with additional capabilities
See REPOSITORY_ARCHITECTURE.md for detailed information.
Testing
oxidize-pdf includes comprehensive test suites to ensure reliability:
# Run standard test suite (synthetic PDFs)
cargo test
# Run all tests including performance benchmarks
cargo test -- --ignored
# Run with local PDF fixtures (if available)
OXIDIZE_PDF_FIXTURES=on cargo test
# Run OCR tests (requires Tesseract installation)
cargo test tesseract_ocr_tests --features ocr-tesseract -- --ignored
Local PDF Fixtures (Optional)
For enhanced testing with real-world PDFs, you can optionally set up local PDF fixtures:
- Create a symbolic link:
tests/fixtures -> /path/to/your/pdf/collection
- The test suite will automatically detect and use these PDFs
- Fixtures are never committed to the repository (excluded in
.gitignore
) - Tests work fine without fixtures using synthetic PDFs
Note: CI/CD always uses synthetic PDFs only for consistent, fast builds.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Roadmap
oxidize-pdf is under active development. Our focus areas include:
Current Focus
- Parsing & Compatibility: Improving support for diverse PDF structures
- Core Operations: Enhancing split, merge, and manipulation capabilities
- Performance: Optimizing memory usage and processing speed
- Stability: Addressing edge cases and error handling
Upcoming Areas
- Extended Format Support: Additional image formats and encodings
- Advanced Text Processing: Improved text extraction and layout analysis
- Enterprise Features: Features designed for production use at scale
- Developer Experience: Better APIs, documentation, and tooling
Long-term Vision
- Comprehensive PDF standard compliance for common use cases
- Production-ready reliability and performance
- Rich ecosystem of tools and integrations
- Sustainable open source development model
We prioritize features based on community feedback and real-world usage. Have a specific need? Open an issue to discuss!
Support
- π Documentation
- π Issue Tracker
- π¬ Discussions
Acknowledgments
Built with β€οΈ using Rust. Special thanks to the Rust community and all contributors.