RS Document¶
High-performance text document cleaning and splitting for RAG (Retrieval Augmented Generation) applications.
Overview¶
rs_document is a Rust-powered Python package that provides fast text processing operations for preparing documents for vector databases and embedding models. It reimplements common document processing functions from LangChain and Unstructured.io with significant performance improvements.
Key Features:
- 20-25x faster than pure Python implementations
- ~23,000 documents/second processing speed
- Parallel batch processing
- Compatible with LangChain's Document model
- Simple, opinionated API
Quick Start¶
Install from PyPI:
Basic usage:
from rs_document import Document, clean_and_split_docs
# Create a document
doc = Document(
page_content="Your document text here...",
metadata={"source": "example.txt"}
)
# Clean and split
doc.clean()
chunks = doc.recursive_character_splitter(1000)
# Or process many documents at once
documents = [doc] # Your list of documents
chunks = clean_and_split_docs(documents, chunk_size=1000)
What Can You Do?¶
Clean Documents¶
Remove artifacts from PDFs, OCR, and web scraping:
doc = Document(
page_content="● Text with bullets, æ ligatures, and extra spaces",
metadata={}
)
doc.clean() # Runs all cleaners
Available cleaners:
- Remove non-ASCII characters
- Convert ligatures (æ → ae, œ → oe)
- Remove bullet symbols
- Normalize whitespace
- Group broken paragraphs
Split Documents¶
Break large documents into chunks for embeddings:
# Recursive splitting (respects paragraphs/sentences/words)
chunks = doc.recursive_character_splitter(1000)
# Simple character splitting
chunks = doc.split_on_num_characters(500)
The recursive splitter:
- Tries to split on paragraph breaks first
- Falls back to sentences, then words, then characters
- Creates ~33% overlap between chunks for better context
Batch Processing¶
Process many documents efficiently with parallel processing:
from rs_document import clean_and_split_docs
# Process 1000s of documents in seconds
chunks = clean_and_split_docs(documents, chunk_size=1000)
Performance¶
Benchmarks show consistent performance improvements:
| Operation | Documents | Python Time | rs_document Time | Speedup |
|---|---|---|---|---|
| Clean + Split | 1,000 | 45s | 2s | 22.5x |
| Clean + Split | 10,000 | 7.5min | 20s | 22.5x |
| Clean + Split | 100,000 | 75min | 3.3min | 22.5x |
Processing rate: ~23,000 documents/second on typical hardware.
Documentation Structure¶
Following the Diataxis framework:
Tutorial¶
Learning-oriented - Start here if you're new to rs_document. Walk through basic concepts and operations with hands-on examples.
How-To Guides¶
Task-oriented - Practical solutions for specific tasks like integrating with LangChain, batch processing, or handling edge cases.
Reference¶
Information-oriented - Complete API documentation for all classes, methods, and functions. Look up exact signatures and parameters.
Explanation¶
Understanding-oriented - Learn about design decisions, performance characteristics, and how the recursive splitter algorithm works.
Use Cases¶
rs_document is designed for:
- RAG pipelines - Prepare documents for vector databases
- Document ingestion - Process large document collections efficiently
- Embedding preparation - Split documents for embedding models
- Text normalization - Clean messy text from various sources
Works with:
- LangChain and LlamaIndex
- OpenAI, Cohere, and other embedding providers
- Pinecone, Weaviate, Qdrant, and other vector databases
- Any Python RAG framework
Why Rust?¶
Text processing in Python is slow for large-scale operations. rs_document uses Rust for:
- Compiled native code performance
- Efficient string operations
- True parallelism (no GIL)
- Memory efficiency
You get Rust's performance with Python's convenience - no Rust knowledge required.
Compatibility¶
- Python: 3.10+
- Platforms: Linux, macOS, Windows (x86_64 and ARM)
- LangChain: Compatible with Document model (metadata must be strings)
Project Status¶
rs_document is production-ready and actively maintained. It's been tested with:
- 102 test cases including property-based tests
- CI testing across Python versions
- Performance benchmarks to prevent regressions
Contributing¶
This project welcomes contributions! See the developer documentation in the dev/ directory:
dev/contributing.md- Development workflow and testingdev/claude.md- Project architecture and designdev/coverage.md- Testing and coverage strategy
Attribution & Credits¶
This project builds upon and is inspired by the following open source projects:
LangChain¶
- Source: https://github.com/langchain-ai/langchain
- Author: LangChain AI
- License: MIT
- Usage: The Document class is designed to be compatible with LangChain's Document model. The recursive character splitter is based on LangChain's RecursiveCharacterTextSplitter algorithm, reimplemented in Rust for performance.
Unstructured.io¶
- Source: https://github.com/Unstructured-IO/unstructured
- Author: Unstructured Technologies, Inc.
- License: Apache 2.0
- Usage: The text cleaning functions are Rust reimplementations of Unstructured.io's post-processor cleaners, maintaining compatible behavior while providing significant performance improvements.
Diataxis¶
- Source: https://diataxis.fr
- Author: Daniele Procida
- License: Creative Commons
- Usage: Documentation structure follows the Diataxis framework for organizing technical documentation into tutorials, how-to guides, reference, and explanation sections.
License¶
See LICENSE.md for details.