Why Rust?¶
Understanding why rs_document exists requires understanding the performance challenges in RAG (Retrieval Augmented Generation) applications and why Rust provides an ideal solution.
The Performance Problem¶
RAG applications need to process large document collections efficiently. A typical RAG pipeline involves three critical text processing steps:
- Cleaning - Removing artifacts from PDFs, OCR errors, formatting issues
- Splitting - Breaking documents into chunks that fit embedding models
- Scale - Processing thousands or millions of documents
These operations seem simple, but at scale they become significant bottlenecks.
Real-World Scenarios¶
Consider these common situations:
Initial Knowledge Base Creation¶
- You have 100,000 PDF documents to process
- Each needs cleaning (5 operations) and splitting (1 operation)
- Pure Python: ~3 hours of processing time
- rs_document: ~8 minutes of processing time
Experimentation with Chunk Sizes¶
- Testing different chunk sizes (500, 1000, 1500 characters) for retrieval quality
- Each experiment requires reprocessing the entire corpus
- Python: 3 hours × 3 experiments = 9 hours
- rs_document: 8 minutes × 3 experiments = 24 minutes
Real-Time Document Ingestion¶
- New documents arrive continuously and need immediate processing
- Python: Can process ~150 documents/second on 8 cores
- rs_document: Can process ~23,000 documents/second on 8 cores
The difference between minutes and hours fundamentally changes what's practical to do.
Why Python Is Slow for Text Processing¶
Pure Python implementations (like LangChain's text splitters and Unstructured.io's cleaners) work well for small datasets but become bottlenecks at scale. This isn't a criticism of those tools—Python has inherent limitations for text processing:
Interpreted Execution Overhead¶
Python code is interpreted at runtime rather than compiled to machine code. Every operation involves:
- Looking up methods in object dictionaries
- Type checking at runtime
- Reference counting for memory management
- Frame creation for function calls
For text processing with millions of operations, this overhead accumulates significantly.
String Immutability¶
Python strings are immutable. Every modification creates a new string:
For cleaning operations that make multiple passes over text, this means:
- Constant memory allocation
- Copying entire strings repeatedly
- Memory fragmentation
- Garbage collection pressure
A document going through 5 cleaning operations creates 5 complete copies in memory.
Global Interpreter Lock (GIL)¶
The GIL prevents true parallelism in Python:
- Only one thread executes Python bytecode at a time
- Multi-core CPUs can't be fully utilized for Python-level operations
- Threading helps with I/O, but not with CPU-bound text processing
Processing 10,000 documents on an 8-core machine uses only 1 core in pure Python (without multiprocessing overhead).
Regex Performance¶
Python's re module, while good, can't match the performance of compiled regex engines:
- No Just-In-Time (JIT) compilation of patterns
- Less aggressive optimization
- Overhead of Python-C boundary for each match
Text cleaning heavily relies on regex for pattern matching and replacement.
Why Rust Is the Solution¶
Rust provides the perfect combination of performance and safety for this problem domain:
Compiled Native Code¶
Rust compiles to native machine code with aggressive optimizations:
// This gets compiled to direct CPU instructions
pub fn clean_text(text: &mut String) {
text.retain(|c| c.is_ascii());
}
- No interpretation overhead
- CPU can execute directly
- Compiler optimizations (inlining, loop unrolling, etc.)
- SIMD (Single Instruction Multiple Data) instructions automatically applied
The same operation in Python involves multiple layers of abstraction before reaching the CPU.
Efficient String Handling¶
Rust's String type allows safe in-place modification:
This means:
- One allocation for the lifetime of cleaning operations
- No intermediate string copies
- Direct memory manipulation
- Minimal garbage collection
A document going through 5 cleaning operations uses the same memory buffer throughout.
True Parallelism with Rayon¶
Rust has no GIL. Rayon provides data parallelism that actually uses all cores:
documents.par_iter() // Parallel iterator
.map(|doc| {
doc.clean();
doc.split(chunk_size)
})
.flatten()
.collect()
This code:
- Automatically splits work across available CPU cores
- Processes multiple documents simultaneously
- Scales linearly with core count
- Has minimal synchronization overhead
On an 8-core machine, you get close to 8x throughput for document processing.
High-Performance Regex¶
Rust's regex crate provides one of the fastest regex engines available:
- DFA (Deterministic Finite Automaton) compilation
- Literal optimizations for fast prefix/suffix matching
- SIMD acceleration for character classes
- Zero-copy operations where possible
Pattern matching and replacement operations are significantly faster than Python's re module.
The PyO3 Bridge¶
rs_document uses PyO3 to make Rust code callable from Python:
#[pyclass]
pub struct Document {
pub page_content: String,
pub metadata: HashMap<String, String>,
}
#[pymethods]
impl Document {
#[new]
pub fn new(page_content: String, metadata: HashMap<String, String>) -> Self {
Self { page_content, metadata }
}
}
PyO3 generates the Python extension module that:
- Exposes Rust structs as Python classes
- Converts between Python and Rust types
- Handles reference counting and memory management
- Minimizes copying across the language boundary
You get Rust performance with Python convenience—no need to rewrite your entire application.
Measured Performance Gains¶
Typical performance improvements on modern hardware (8-core CPU):
| Operation | Python | rs_document | Speedup |
|---|---|---|---|
| Single document cleaning | ~20ms | <1ms | 20-25x |
| Single document splitting | ~100ms | <5ms | 20-25x |
| Batch processing (1000 docs) | ~20s | ~0.9s | 22x |
| Batch processing (1M docs) | ~6 hours | ~15 min | 24x |
The speedup is consistent across different dataset sizes—no performance degradation with large batches.
When Performance Matters¶
Not every use case needs this level of performance optimization:
Small Workloads (< 100 documents)¶
For small datasets, the speedup might not matter:
- Python: 2 seconds
- rs_document: 0.1 seconds
- Saved: 1.9 seconds
This difference is negligible in most workflows.
Large Workloads (> 10,000 documents)¶
For large datasets, the speedup is transformative:
- Python: 20 minutes
- rs_document: 50 seconds
- Saved: 19+ minutes
This changes what's practical:
- Enables rapid experimentation
- Makes real-time processing feasible
- Reduces infrastructure costs
Critical Factors¶
Performance matters when you have:
- Large document collections (thousands to millions)
- Frequent reprocessing (experimentation, updates)
- Real-time requirements (immediate ingestion)
- Cost constraints (reducing compute time)
If any of these apply, rs_document's performance advantage is significant.
Design Trade-offs¶
Achieving this performance required specific design choices:
Opinionated Defaults - Fixed parameters enable aggressive optimization String-Only Metadata - Simple types avoid complex Python-Rust conversions Limited Customization - Specific behavior allows compiler optimizations Rust Dependency - Building from source requires Rust toolchain
These trade-offs are detailed in Design Philosophy.
The Bottom Line¶
Python is excellent for many tasks, but CPU-bound text processing at scale isn't one of them. Rust's compiled execution, efficient memory handling, and true parallelism provide dramatic performance improvements while maintaining Python's ease of use through PyO3.
rs_document doesn't replace your entire stack—it replaces the specific bottleneck of text cleaning and splitting, letting you keep Python for everything else.
Next: Design Philosophy - Understanding the deliberate choices behind the API