API Reference¶
Complete reference documentation for all public APIs in rs_document.
Quick Navigation¶
Core Components¶
- Document Class - Document constructor and attributes
- Constructor and initialization
page_contentattributemetadataattribute
Document Processing¶
-
Cleaning Methods - Text cleaning and normalization
clean()- Run all cleanersclean_extra_whitespace()- Normalize whitespaceclean_ligatures()- Convert typographic ligaturesclean_bullets()- Remove bullet charactersclean_non_ascii_chars()- Remove non-ASCII charactersgroup_broken_paragraphs()- Join split paragraphs
-
Splitting Methods - Document chunking strategies
recursive_character_splitter()- Smart splitting with overlapsplit_on_num_characters()- Fixed-size splitting
Utilities¶
- Utility Functions - Batch processing and helpers
clean_and_split_docs()- Parallel processing for multiple documents
Reference Information¶
- Types and Constants - Type hints, defaults, and error handling
- Type signatures
- Default values and constants
- Error handling patterns
- Compatibility notes
Overview¶
rs_document is a high-performance Python library for document processing, built with Rust for speed. The library provides:
- Document representation: Simple, LangChain-compatible document structure
- Text cleaning: Normalize whitespace, remove artifacts, fix ligatures
- Document splitting: Split large documents into chunks for RAG applications
- Parallel processing: Process thousands of documents efficiently
Basic Usage¶
from rs_document import Document, clean_and_split_docs
# Create a document
doc = Document(
page_content="Your text content here",
metadata={"source": "example.txt", "page": "1"}
)
# Clean and split
doc.clean()
chunks = doc.recursive_character_splitter(chunk_size=1000)
# Or process multiple documents in parallel
documents = [doc1, doc2, doc3]
all_chunks = clean_and_split_docs(documents, chunk_size=1000)
Performance¶
- Fast: 20-25x faster than pure Python implementations
- Parallel: Automatically uses all CPU cores for batch processing
- Scalable: Process ~23,000 documents per second on typical hardware
Requirements¶
- Python 3.10 or higher
- Pre-built wheels available for most platforms (Linux, macOS, Windows)