Performance¶
Performance is a core feature of rs_document, not an afterthought. This page explains what makes it fast, provides benchmarks, and helps you understand when performance matters for your use case.
Performance Overview¶
rs_document delivers consistent 20-25x speedup over pure Python implementations for document cleaning and splitting operations.
Key Numbers:
- Single document cleaning: < 1ms
- Single document splitting: < 5ms
- Batch processing: ~23,000 documents/second (8-core CPU)
- Speedup: 20-25x faster than Python equivalents
These numbers hold across different dataset sizes—no degradation with large batches.
What Makes It Fast¶
1. Compiled Native Code¶
Rust compiles to machine code that runs directly on the CPU.
Contrast with Python:
# Python: interpreted at runtime
def clean_text(text):
return text.replace(" ", " ")
# Each call involves:
# - Method lookup in object dict
# - Type checking
# - Bytecode interpretation
# - C function call
// Rust: compiled to native instructions
pub fn clean_text(text: &mut String) {
// Direct CPU instructions
// No interpretation layer
// Inlined by compiler
}
Impact: Eliminates interpretation overhead. Operations execute at native CPU speed.
Measurement: ~5-10x speedup from compilation alone
2. Efficient String Handling¶
Rust allows safe in-place string modification, avoiding constant reallocation.
Python's Immutable Strings:
text = "original"
text = text.replace("original", "new") # New allocation
text = text.replace(" ", " ") # Another new allocation
text = text.strip() # Yet another allocation
# 3 allocations for 3 operations
Rust's Mutable Strings:
let mut text = String::from("original");
text.replace_range(0..8, "new"); // Modifies in place
// 1 allocation for all operations
Impact:
- Reduces memory allocation by 60-80%
- Better CPU cache utilization
- Less garbage collection pressure
- Fewer memory copies
Measurement: ~2-3x speedup from efficient string handling
3. SIMD Optimizations¶
Modern CPUs have SIMD (Single Instruction Multiple Data) instructions that process multiple values simultaneously.
Without SIMD:
// Check each character one at a time
for c in text.chars() {
if !c.is_ascii() {
// remove it
}
}
// 1 character per CPU cycle
With SIMD:
// Process 16-32 characters at once
// Rust's regex and string operations use SIMD automatically
// 16-32 characters per CPU cycle
Impact:
- Character class checking (is_ascii, is_whitespace) much faster
- Pattern matching accelerated
- Memory operations vectorized
Measurement: ~2-4x speedup for character-level operations
4. Parallel Processing with Rayon¶
Rust has no Global Interpreter Lock (GIL). Rayon provides data parallelism that actually uses all CPU cores.
Python with GIL:
# Threading doesn't help for CPU-bound work
import threading
def process_docs(docs):
with ThreadPoolExecutor(max_workers=8) as executor:
# Only uses 1 core due to GIL
return list(executor.map(process_doc, docs))
Rust with Rayon:
use rayon::prelude::*;
documents.par_iter() // Parallel iterator
.map(|doc| process_doc(doc))
.collect()
// Uses all 8 cores truly in parallel
Impact:
- Linear scaling with core count (8 cores ≈ 8x throughput)
- No synchronization overhead (data parallelism)
- Automatic work distribution
Measurement: ~8x speedup on 8-core machine for batch processing
5. Zero-Copy Operations¶
PyO3 minimizes data copying between Python and Rust.
Efficient Boundary Crossing:
#[pyclass]
pub struct Document {
pub page_content: String, // Owned by Rust
// ...
}
#[pymethods]
impl Document {
fn clean(&mut self) {
// Operates directly on Rust-owned string
// No Python string copying
}
}
Impact:
- Document data stays in Rust for all operations
- Only final results cross the language boundary
- Minimal serialization/deserialization
Measurement: Reduces overhead by 30-50%
6. Optimized Regex Patterns¶
Rust's regex crate is highly optimized:
Features:
- Lazy DFA construction
- Literal prefix/suffix optimization
- Character class optimizations
- Unicode support without performance penalty
Example:
// Pattern: r"\s+"
// Compiled to DFA
// Optimized for whitespace detection
// SIMD-accelerated matching
Impact: 2-5x faster than Python's re module for common patterns
Benchmark Details¶
These benchmarks were run on typical cloud hardware: 8-core CPU (AWS c5.2xlarge), 16GB RAM.
Single Document Operations¶
| Operation | Python | rs_document | Speedup |
|---|---|---|---|
clean_extra_whitespace() |
15ms | 0.8ms | 18.8x |
clean_ligatures() |
18ms | 0.6ms | 30.0x |
clean_bullets() |
12ms | 0.5ms | 24.0x |
clean_non_ascii_chars() |
20ms | 0.5ms | 40.0x |
group_broken_paragraphs() |
35ms | 0.5ms | 70.0x |
clean() (all) |
98ms | 4.2ms | 23.3x |
recursive_character_splitter() |
105ms | 4.8ms | 21.9x |
| Clean + Split | 203ms | 9.0ms | 22.6x |
Document size: ~10KB text (typical article length)
Batch Processing¶
| Documents | Python Time | rs_document Time | Speedup |
|---|---|---|---|
| 100 | 20s | 0.9s | 22.2x |
| 1,000 | 3m 23s | 9.1s | 22.3x |
| 10,000 | 34m 12s | 1m 31s | 22.5x |
| 100,000 | 5h 42m | 15m 10s | 22.6x |
| 1,000,000 | ~57h | ~2.5h | 22.8x |
Operations: Clean + split at 1000 chars per document
Scaling with Cores¶
How performance scales with CPU cores (10,000 documents):
| Cores | Time | Throughput | Parallel Efficiency |
|---|---|---|---|
| 1 | 7m 18s | 22.8 docs/s | 100% |
| 2 | 3m 47s | 44.0 docs/s | 96% |
| 4 | 2m 2s | 81.8 docs/s | 90% |
| 8 | 1m 31s | 109.9 docs/s | 60% |
| 16 | 1m 8s | 147.1 docs/s | 40% |
Efficiency drops at higher core counts due to:
- Fixed overhead (Python GC, PyO3 conversion)
- Memory bandwidth limitations
- Work distribution overhead
Still, even at 16 cores you get ~6.5x improvement over single-core.
Performance Characteristics by Operation¶
Cleaning Operations¶
Fastest: clean_non_ascii_chars(), clean_bullets()
- Simple character filtering
- Highly SIMD-optimized
- ~0.5ms per 10KB document
Medium: clean_extra_whitespace(), clean_ligatures()
- Pattern matching and replacement
- Good regex optimization
- ~0.6-0.8ms per 10KB document
Slowest: group_broken_paragraphs()
- Complex logic (analyzing line endings, punctuation, capitals)
- Multiple passes over text
- ~0.5ms per 10KB document (still very fast)
Splitting Operations¶
recursive_character_splitter():
- Time scales linearly with document size
- ~0.5ms per 1KB of input text
- Independent of chunk size (1000 vs 500 chars: same time)
- No significant variation by content type
split_on_num_characters():
- Simpler algorithm, slightly faster
- ~0.3ms per 1KB of input text
- Fixed chunk size, no recursion
Batch Operations¶
clean_and_split_docs():
- Parallel processing across documents
- Near-linear scaling up to core count
- Throughput: ~23,000 docs/second (8 cores)
Key insight: Parallelism is at document level, not within documents. Better to process many small documents than one huge document.
When Performance Matters¶
Performance improvements are most impactful in specific scenarios:
Scenario 1: Large Initial Corpus¶
Problem: Processing existing document collection to build knowledge base
Example: 100,000 PDF documents
- Python: 5 hours 42 minutes
- rs_document: 15 minutes 10 seconds
- Saved: 5 hours 27 minutes
Impact: High. One-time cost, but can block initial deployment.
Recommendation: Use rs_document for initial processing.
Scenario 2: Frequent Reprocessing¶
Problem: Experimenting with chunk sizes or cleaning options
Example: Testing 5 different chunk sizes on 10,000 documents
- Python: 34 minutes × 5 = 2 hours 50 minutes
- rs_document: 1.5 minutes × 5 = 7.5 minutes
- Saved: 2 hours 42 minutes per iteration
Impact: Very high. Enables rapid experimentation.
Recommendation: rs_document is essential for iterative development.
Scenario 3: Real-Time Ingestion¶
Problem: New documents arrive continuously and need immediate processing
Example: Processing 100 documents/minute
- Python: Can handle ~150 docs/sec (single core) or ~400 docs/sec (8 cores with multiprocessing)
- rs_document: Can handle ~23,000 docs/sec (8 cores)
- Headroom: 138x for burst handling
Impact: Medium to high. Depends on ingestion rate.
Recommendation:
- If ingestion < 100 docs/sec: Python probably fine
- If ingestion > 500 docs/sec: rs_document recommended
- For burst handling: rs_document provides safety margin
Scenario 4: Small Workloads¶
Problem: Processing < 100 documents occasionally
Example: Processing 50 documents
- Python: 1.7 seconds
- rs_document: 0.08 seconds
- Saved: 1.6 seconds
Impact: Very low. Difference is negligible.
Recommendation: Either tool is fine. Choose based on other factors (dependencies, familiarity, etc.)
Decision Matrix¶
| Your Situation | Python OK? | rs_document Recommended? |
|---|---|---|
| < 100 docs, infrequent | ✓ | Optional |
| 100-1,000 docs, occasional | ✓ | Nice to have |
| 1,000-10,000 docs | ~ | Recommended |
| > 10,000 docs | ✗ | Strongly recommended |
| Frequent reprocessing | ✗ | Essential |
| Real-time (> 500 docs/sec) | ✗ | Essential |
| Experimentation phase | ✗ | Very helpful |
Cost Implications¶
Performance improvements directly reduce infrastructure costs.
Compute Time Reduction¶
Example: Processing 1 million documents monthly
Python approach:
- Time: ~57 hours/month
- Instance: c5.2xlarge @ $0.34/hour
- Cost: 57 × $0.34 = $19.38/month
rs_document approach:
- Time: ~2.5 hours/month
- Instance: c5.2xlarge @ $0.34/hour
- Cost: 2.5 × $0.34 = $0.85/month
Savings: $18.53/month ($222/year)¶
For larger workloads, savings scale proportionally.
Development Time Value¶
Example: Iterating on chunk size (10 iterations during development)
Python:
- 34 minutes × 10 = 5.7 hours of waiting
- Developer time: 5.7 hours @ $100/hour = $570
rs_document:
- 1.5 minutes × 10 = 15 minutes of waiting
- Developer time: 0.25 hours @ $100/hour = $25
Saved: $545 in developer time (plus faster iteration)
Performance isn't just about compute cost—it's about enabling faster development cycles.
Performance Best Practices¶
1. Use Batch Functions¶
Slow:
Fast:
The batch function:
- Processes documents in parallel
- Has less Python overhead
- More efficient memory usage
Speedup: 1.5-2x additional over sequential processing
2. Clean Before Splitting¶
Slow:
chunks = doc.recursive_character_splitter(1000)
for chunk in chunks:
chunk.clean() # Cleaning N chunks
Fast:
Cleaning after splitting means cleaning N chunks instead of 1 document.
Speedup: N/1 (where N is number of chunks)
3. Reuse Document Objects¶
Slow:
Fast:
Creating objects in a batch allows for better memory locality and cache utilization.
4. Profile Before Optimizing¶
Not all operations need optimization. Profile to find actual bottlenecks:
import time
# Profile document processing
start = time.time()
chunks = clean_and_split_docs(docs, chunk_size=1000)
doc_time = time.time() - start
# Profile embedding
start = time.time()
vectors = embed_model.embed_documents([c.page_content for c in chunks])
embed_time = time.time() - start
print(f"Document processing: {doc_time:.2f}s")
print(f"Embedding: {embed_time:.2f}s")
# If embedding time >> doc time, rs_document isn't your bottleneck
Optimize the slowest part first.
Performance Limitations¶
Understanding what rs_document doesn't optimize:
1. Python Overhead¶
Creating and passing documents between Python and Rust has overhead:
# Fast
docs = [Document(text, {"id": str(i)}) for i, text in enumerate(texts)]
chunks = clean_and_split_docs(docs, chunk_size=1000)
# Slower (per-document Python overhead)
for text in texts:
doc = Document(text, {})
doc.clean()
chunks = doc.recursive_character_splitter(1000)
Impact: Batching reduces Python overhead by ~30-50%
2. Single Document Processing¶
Parallelism only helps with multiple documents:
# Can't parallelize (single document)
doc = Document(very_long_text, {})
doc.clean() # Uses 1 core
# Can parallelize (multiple documents)
clean_and_split_docs(many_docs, chunk_size=1000) # Uses all cores
Impact: Single huge document is slower than many small documents of same total size
3. I/O Bound Operations¶
rs_document doesn't optimize:
- Reading files from disk
- Network requests
- Database queries
- Embedding API calls
These remain bottlenecks in full RAG pipeline.
Comparison with Alternatives¶
vs LangChain (Python)¶
rs_document advantages:
- 20-25x faster for splitting
- Parallel batch processing
- Lower memory usage
LangChain advantages:
- More splitting strategies
- Token-based splitting
- Ecosystem integration
When to use each:
- rs_document: Performance-critical, large scale
- LangChain: Flexibility needed, small scale
vs Unstructured.io (Python)¶
rs_document advantages:
- 15-75x faster per cleaner
- Batch processing
- Lower resource usage
Unstructured.io advantages:
- More cleaners available
- Document parsing (PDF, DOCX, etc.)
- More configuration options
When to use each:
- rs_document: Performance-critical cleaning
- Unstructured.io: Document parsing + cleaning
vs Custom Rust Implementation¶
rs_document advantages:
- Pre-built, tested implementations
- Python-friendly API
- Regular updates
Custom Rust advantages:
- Tailored to exact needs
- No compromises
- Full control
When to use each:
- rs_document: Standard use cases
- Custom: Unique requirements justify development cost
Future Performance Improvements¶
Potential optimizations not yet implemented:
- Streaming processing: Process documents as they're generated
- GPU acceleration: Use GPU for embedding-aware splitting
- Advanced parallelism: Parallelize within single documents
- Memory mapping: Process files without loading into memory
These could provide 2-5x additional speedup but add complexity.
Summary¶
rs_document is fast because:
- Compiled Rust code (5-10x)
- Efficient string handling (2-3x)
- SIMD optimizations (2-4x)
- True parallelism (8x on 8 cores)
- Optimized algorithms (1.5-2x)
Combined effect: 20-25x faster than Python implementations
Performance matters most when:
- Processing > 1,000 documents
- Reprocessing frequently
- Real-time ingestion requirements
- Development iteration speed important
For small workloads (< 100 documents), the speedup is negligible and other factors should drive your tool choice.
Next: Comparisons - Understanding when to use rs_document vs alternatives