Why Rust?¶

Understanding why rs_document exists requires understanding the performance challenges in RAG (Retrieval Augmented Generation) applications and why Rust provides an ideal solution.

The Performance Problem¶

RAG applications need to process large document collections efficiently. A typical RAG pipeline involves three critical text processing steps:

Cleaning - Removing artifacts from PDFs, OCR errors, formatting issues
Splitting - Breaking documents into chunks that fit embedding models
Scale - Processing thousands or millions of documents

These operations seem simple, but at scale they become significant bottlenecks.

Real-World Scenarios¶

Consider these common situations:

Initial Knowledge Base Creation¶

You have 100,000 PDF documents to process
Each needs cleaning (5 operations) and splitting (1 operation)
Pure Python: ~3 hours of processing time
rs_document: ~8 minutes of processing time

Experimentation with Chunk Sizes¶

Testing different chunk sizes (500, 1000, 1500 characters) for retrieval quality
Each experiment requires reprocessing the entire corpus
Python: 3 hours × 3 experiments = 9 hours
rs_document: 8 minutes × 3 experiments = 24 minutes

Real-Time Document Ingestion¶

New documents arrive continuously and need immediate processing
Python: Can process ~150 documents/second on 8 cores
rs_document: Can process ~23,000 documents/second on 8 cores

The difference between minutes and hours fundamentally changes what's practical to do.

Why Python Is Slow for Text Processing¶

Pure Python implementations (like LangChain's text splitters and Unstructured.io's cleaners) work well for small datasets but become bottlenecks at scale. This isn't a criticism of those tools—Python has inherent limitations for text processing:

Interpreted Execution Overhead¶

Python code is interpreted at runtime rather than compiled to machine code. Every operation involves:

Looking up methods in object dictionaries
Type checking at runtime
Reference counting for memory management
Frame creation for function calls

For text processing with millions of operations, this overhead accumulates significantly.

String Immutability¶

Python strings are immutable. Every modification creates a new string:

text = "hello world"
text = text.replace("world", "python")  # Creates entirely new string

For cleaning operations that make multiple passes over text, this means:

Constant memory allocation
Copying entire strings repeatedly
Memory fragmentation
Garbage collection pressure

A document going through 5 cleaning operations creates 5 complete copies in memory.

Global Interpreter Lock (GIL)¶

The GIL prevents true parallelism in Python:

Only one thread executes Python bytecode at a time
Multi-core CPUs can't be fully utilized for Python-level operations
Threading helps with I/O, but not with CPU-bound text processing

Processing 10,000 documents on an 8-core machine uses only 1 core in pure Python (without multiprocessing overhead).

Regex Performance¶

Python's re module, while good, can't match the performance of compiled regex engines:

No Just-In-Time (JIT) compilation of patterns
Less aggressive optimization
Overhead of Python-C boundary for each match

Text cleaning heavily relies on regex for pattern matching and replacement.

Why Rust Is the Solution¶

Rust provides the perfect combination of performance and safety for this problem domain:

Compiled Native Code¶

Rust compiles to native machine code with aggressive optimizations:

// This gets compiled to direct CPU instructions
pub fn clean_text(text: &mut String) {
    text.retain(|c| c.is_ascii());
}

No interpretation overhead
CPU can execute directly
Compiler optimizations (inlining, loop unrolling, etc.)
SIMD (Single Instruction Multiple Data) instructions automatically applied

The same operation in Python involves multiple layers of abstraction before reaching the CPU.

Efficient String Handling¶

Rust's String type allows safe in-place modification:

let mut text = String::from("hello world");
text.replace_range(6..11, "rust");  // Modifies in place

This means:

One allocation for the lifetime of cleaning operations
No intermediate string copies
Direct memory manipulation
Minimal garbage collection

A document going through 5 cleaning operations uses the same memory buffer throughout.

True Parallelism with Rayon¶

Rust has no GIL. Rayon provides data parallelism that actually uses all cores:

documents.par_iter()  // Parallel iterator
    .map(|doc| {
        doc.clean();
        doc.split(chunk_size)
    })
    .flatten()
    .collect()

This code:

Automatically splits work across available CPU cores
Processes multiple documents simultaneously
Scales linearly with core count
Has minimal synchronization overhead

On an 8-core machine, you get close to 8x throughput for document processing.

High-Performance Regex¶

Rust's regex crate provides one of the fastest regex engines available:

DFA (Deterministic Finite Automaton) compilation
Literal optimizations for fast prefix/suffix matching
SIMD acceleration for character classes
Zero-copy operations where possible

Pattern matching and replacement operations are significantly faster than Python's re module.

The PyO3 Bridge¶

rs_document uses PyO3 to make Rust code callable from Python:

#[pyclass]
pub struct Document {
    pub page_content: String,
    pub metadata: HashMap<String, String>,
}

#[pymethods]
impl Document {
    #[new]
    pub fn new(page_content: String, metadata: HashMap<String, String>) -> Self {
        Self { page_content, metadata }
    }
}

PyO3 generates the Python extension module that:

Exposes Rust structs as Python classes
Converts between Python and Rust types
Handles reference counting and memory management
Minimizes copying across the language boundary

You get Rust performance with Python convenience—no need to rewrite your entire application.

Measured Performance Gains¶

Typical performance improvements on modern hardware (8-core CPU):

Operation	Python	rs_document	Speedup
Single document cleaning	~20ms	<1ms	20-25x
Single document splitting	~100ms	<5ms	20-25x
Batch processing (1000 docs)	~20s	~0.9s	22x
Batch processing (1M docs)	~6 hours	~15 min	24x

The speedup is consistent across different dataset sizes—no performance degradation with large batches.

When Performance Matters¶

Not every use case needs this level of performance optimization:

Small Workloads (< 100 documents)¶

For small datasets, the speedup might not matter:

Python: 2 seconds
rs_document: 0.1 seconds
Saved: 1.9 seconds

This difference is negligible in most workflows.

Large Workloads (> 10,000 documents)¶

For large datasets, the speedup is transformative:

Python: 20 minutes
rs_document: 50 seconds
Saved: 19+ minutes

This changes what's practical:

Enables rapid experimentation
Makes real-time processing feasible
Reduces infrastructure costs

Critical Factors¶

Performance matters when you have:

Large document collections (thousands to millions)
Frequent reprocessing (experimentation, updates)
Real-time requirements (immediate ingestion)
Cost constraints (reducing compute time)

If any of these apply, rs_document's performance advantage is significant.

Design Trade-offs¶

Achieving this performance required specific design choices:

Opinionated Defaults - Fixed parameters enable aggressive optimization String-Only Metadata - Simple types avoid complex Python-Rust conversions Limited Customization - Specific behavior allows compiler optimizations Rust Dependency - Building from source requires Rust toolchain

These trade-offs are detailed in Design Philosophy.

The Bottom Line¶

Python is excellent for many tasks, but CPU-bound text processing at scale isn't one of them. Rust's compiled execution, efficient memory handling, and true parallelism provide dramatic performance improvements while maintaining Python's ease of use through PyO3.

rs_document doesn't replace your entire stack—it replaces the specific bottleneck of text cleaning and splitting, letting you keep Python for everything else.

Next: Design Philosophy - Understanding the deliberate choices behind the API