API Reference¶

Complete reference documentation for all public APIs in rs_document.

Core Components¶

Document Class - Document constructor and attributes
- Constructor and initialization
- page_content attribute
- metadata attribute

Document Processing¶

Cleaning Methods - Text cleaning and normalization
- clean() - Run all cleaners
- clean_extra_whitespace() - Normalize whitespace
- clean_ligatures() - Convert typographic ligatures
- clean_bullets() - Remove bullet characters
- clean_non_ascii_chars() - Remove non-ASCII characters
- group_broken_paragraphs() - Join split paragraphs
Splitting Methods - Document chunking strategies
- recursive_character_splitter() - Smart splitting with overlap
- split_on_num_characters() - Fixed-size splitting

Utilities¶

Utility Functions - Batch processing and helpers
- clean_and_split_docs() - Parallel processing for multiple documents

Reference Information¶

Types and Constants - Type hints, defaults, and error handling
- Type signatures
- Default values and constants
- Error handling patterns
- Compatibility notes

Overview¶

rs_document is a high-performance Python library for document processing, built with Rust for speed. The library provides:

Document representation: Simple, LangChain-compatible document structure
Text cleaning: Normalize whitespace, remove artifacts, fix ligatures
Document splitting: Split large documents into chunks for RAG applications
Parallel processing: Process thousands of documents efficiently

Basic Usage¶

from rs_document import Document, clean_and_split_docs

# Create a document
doc = Document(
    page_content="Your text content here",
    metadata={"source": "example.txt", "page": "1"}
)

# Clean and split
doc.clean()
chunks = doc.recursive_character_splitter(chunk_size=1000)

# Or process multiple documents in parallel
documents = [doc1, doc2, doc3]
all_chunks = clean_and_split_docs(documents, chunk_size=1000)

Performance¶

Fast: 20-25x faster than pure Python implementations
Parallel: Automatically uses all CPU cores for batch processing
Scalable: Process ~23,000 documents per second on typical hardware

Requirements¶

Python 3.10 or higher
Pre-built wheels available for most platforms (Linux, macOS, Windows)

API Reference¶

Quick Navigation¶

Core Components¶

Document Processing¶

Utilities¶

Reference Information¶

Overview¶

Basic Usage¶

Performance¶

Requirements¶