RS Document¶

High-performance text document cleaning and splitting for RAG (Retrieval Augmented Generation) applications.

Overview¶

rs_document is a Rust-powered Python package that provides fast text processing operations for preparing documents for vector databases and embedding models. It reimplements common document processing functions from LangChain and Unstructured.io with significant performance improvements.

Key Features:

20-25x faster than pure Python implementations
~23,000 documents/second processing speed
Parallel batch processing
Compatible with LangChain's Document model
Simple, opinionated API

Quick Start¶

Install from PyPI:

pip install rs-document

Basic usage:

from rs_document import Document, clean_and_split_docs

# Create a document
doc = Document(
    page_content="Your document text here...",
    metadata={"source": "example.txt"}
)

# Clean and split
doc.clean()
chunks = doc.recursive_character_splitter(1000)

# Or process many documents at once
documents = [doc]  # Your list of documents
chunks = clean_and_split_docs(documents, chunk_size=1000)

What Can You Do?¶

Clean Documents¶

Remove artifacts from PDFs, OCR, and web scraping:

doc = Document(
    page_content="●  Text with bullets, æ ligatures, and  extra   spaces",
    metadata={}
)

doc.clean()  # Runs all cleaners

Available cleaners:

Remove non-ASCII characters
Convert ligatures (æ → ae, œ → oe)
Remove bullet symbols
Normalize whitespace
Group broken paragraphs

Split Documents¶

Break large documents into chunks for embeddings:

# Recursive splitting (respects paragraphs/sentences/words)
chunks = doc.recursive_character_splitter(1000)

# Simple character splitting
chunks = doc.split_on_num_characters(500)

The recursive splitter:

Tries to split on paragraph breaks first
Falls back to sentences, then words, then characters
Creates ~33% overlap between chunks for better context

Batch Processing¶

Process many documents efficiently with parallel processing:

from rs_document import clean_and_split_docs

# Process 1000s of documents in seconds
chunks = clean_and_split_docs(documents, chunk_size=1000)

Performance¶

Benchmarks show consistent performance improvements:

Operation	Documents	Python Time	rs_document Time	Speedup
Clean + Split	1,000	45s	2s	22.5x
Clean + Split	10,000	7.5min	20s	22.5x
Clean + Split	100,000	75min	3.3min	22.5x

Processing rate: ~23,000 documents/second on typical hardware.

Documentation Structure¶

Following the Diataxis framework:

Tutorial ¶

Learning-oriented - Start here if you're new to rs_document. Walk through basic concepts and operations with hands-on examples.

How-To Guides ¶

Task-oriented - Practical solutions for specific tasks like integrating with LangChain, batch processing, or handling edge cases.

Reference ¶

Information-oriented - Complete API documentation for all classes, methods, and functions. Look up exact signatures and parameters.

Explanation ¶

Understanding-oriented - Learn about design decisions, performance characteristics, and how the recursive splitter algorithm works.

Use Cases¶

rs_document is designed for:

RAG pipelines - Prepare documents for vector databases
Document ingestion - Process large document collections efficiently
Embedding preparation - Split documents for embedding models
Text normalization - Clean messy text from various sources

Works with:

LangChain and LlamaIndex
OpenAI, Cohere, and other embedding providers
Pinecone, Weaviate, Qdrant, and other vector databases
Any Python RAG framework

Why Rust?¶

Text processing in Python is slow for large-scale operations. rs_document uses Rust for:

Compiled native code performance
Efficient string operations
True parallelism (no GIL)
Memory efficiency

You get Rust's performance with Python's convenience - no Rust knowledge required.

Compatibility¶

Python: 3.10+
Platforms: Linux, macOS, Windows (x86_64 and ARM)
LangChain: Compatible with Document model (metadata must be strings)

Project Status¶

rs_document is production-ready and actively maintained. It's been tested with:

102 test cases including property-based tests
CI testing across Python versions
Performance benchmarks to prevent regressions

Contributing¶

This project welcomes contributions! See the developer documentation in the dev/ directory:

dev/contributing.md - Development workflow and testing
dev/claude.md - Project architecture and design
dev/coverage.md - Testing and coverage strategy

Attribution & Credits¶

This project builds upon and is inspired by the following open source projects:

LangChain¶

Source: https://github.com/langchain-ai/langchain
Author: LangChain AI
License: MIT
Usage: The Document class is designed to be compatible with LangChain's Document model. The recursive character splitter is based on LangChain's RecursiveCharacterTextSplitter algorithm, reimplemented in Rust for performance.

Unstructured.io¶

Source: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies, Inc.
License: Apache 2.0
Usage: The text cleaning functions are Rust reimplementations of Unstructured.io's post-processor cleaners, maintaining compatible behavior while providing significant performance improvements.

Diataxis¶

Source: https://diataxis.fr
Author: Daniele Procida
License: Creative Commons
Usage: Documentation structure follows the Diataxis framework for organizing technical documentation into tutorials, how-to guides, reference, and explanation sections.

License¶

See LICENSE.md for details.