Splitting Methods¶
Methods for splitting documents into smaller chunks. All splitting methods return a list of new Document instances, leaving the original document unchanged.
Overview¶
Splitting methods create new documents from the original:
from rs_document import Document
doc = Document(page_content="Long text...", metadata={"source": "file.txt"})
chunks = doc.recursive_character_splitter(1000)
# Original document is unchanged
print(doc.page_content) # Still "Long text..."
# New documents created
print(len(chunks)) # Number of chunks
print(chunks[0].metadata) # {"source": "file.txt"} - metadata copied
Methods¶
recursive_character_splitter()¶
Split document into chunks using recursive strategy with natural language boundaries.
Signature:
Description:
Splits a document into chunks of approximately chunk_size characters, attempting to split on natural language boundaries. Uses a recursive approach, trying multiple separators in order of preference.
Parameters:
chunk_size(int): Target size for each chunk in characters. Chunks will not exceed this size.
Returns:
list[Document] - List of Document instances, each with:
page_content: A chunk of the original text (≤chunk_sizecharacters)metadata: Copy of the original document's metadata
Splitting Strategy:
The method tries separators in this order of preference:
- Paragraph breaks (
\n\n) - Preferred for maintaining semantic coherence - Line breaks (
\n) - If paragraphs are too large - Word boundaries (spaces) - If lines are too large
- Character boundaries - Last resort if words are too large
Chunk Overlap:
Creates approximately 33% overlap between consecutive chunks. This overlap is hardcoded and ensures context is preserved across chunk boundaries.
Example:
doc = Document(
page_content="A" * 5000,
metadata={"source": "file.txt"}
)
chunks = doc.recursive_character_splitter(1000)
print(len(chunks)) # Number of chunks created (approximately 5-7 due to overlap)
print(len(chunks[0].page_content)) # ~1000 or less
print(len(chunks[1].page_content)) # ~1000 or less
print(chunks[0].metadata) # {"source": "file.txt"}
print(chunks[1].metadata) # {"source": "file.txt"}
Paragraph Splitting Example:
doc = Document(
page_content="""First paragraph with some content.
Second paragraph with more content.
Third paragraph with even more content.""",
metadata={"doc_id": "123"}
)
chunks = doc.recursive_character_splitter(50)
# Splits on paragraph breaks when possible
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk.page_content[:30]}...")
print(f"Length: {len(chunk.page_content)}")
Overlap Demonstration:
doc = Document(
page_content="Word " * 1000, # 5000 characters
metadata={}
)
chunks = doc.recursive_character_splitter(100)
# Check overlap between consecutive chunks
chunk1_end = chunks[0].page_content[-20:]
chunk2_start = chunks[1].page_content[:20]
print(f"End of chunk 1: '{chunk1_end}'")
print(f"Start of chunk 2: '{chunk2_start}'")
# Likely to see overlapping content
Edge Cases:
# Empty document
doc = Document(page_content="", metadata={})
chunks = doc.recursive_character_splitter(1000)
print(chunks) # []
# Short document (smaller than chunk_size)
doc = Document(page_content="Short", metadata={})
chunks = doc.recursive_character_splitter(1000)
print(len(chunks)) # 1
print(chunks[0].page_content) # "Short"
# Very long single word (no spaces)
doc = Document(page_content="A" * 5000, metadata={})
chunks = doc.recursive_character_splitter(1000)
# Will split by characters as last resort
Characteristics:
- Respects boundaries: Prefers paragraph, then line, then word boundaries
- Overlap: ~33% overlap between chunks (hardcoded)
- Metadata preservation: All chunks receive copy of original metadata
- Size guarantee: No chunk exceeds
chunk_size - Context preservation: Overlap ensures semantic context across boundaries
Use Cases:
- RAG applications: When context is important for retrieval
- Semantic search: Maintaining paragraph coherence
- Question answering: Overlapping chunks help answer questions at boundaries
- Document analysis: Preserving document structure
Performance:
import time
from rs_document import Document
doc = Document(page_content="A" * 1_000_000, metadata={})
start = time.time()
chunks = doc.recursive_character_splitter(1000)
elapsed = time.time() - start
print(f"Split into {len(chunks)} chunks in {elapsed:.3f} seconds")
# Fast even for large documents (Rust implementation)
Comparison with split_on_num_characters():
| Feature | recursive_character_splitter() |
split_on_num_characters() |
|---|---|---|
| Boundary respect | Yes (paragraph → line → word → char) | No (exact character positions) |
| Overlap | Yes (~33%) | No |
| Chunk size | Target (may be smaller) | Exact (except last chunk) |
| Use case | RAG, semantic applications | Uniform processing |
split_on_num_characters()¶
Split document into chunks of exactly the specified size with no overlap.
Signature:
Description:
Splits a document into fixed-size chunks at exact character boundaries. Does not consider word, line, or paragraph boundaries. Creates no overlap between chunks.
Parameters:
num_chars(int): Number of characters per chunk
Returns:
list[Document] - List of Document instances, each with:
page_content: Exactlynum_charscharacters (except possibly the last chunk)metadata: Copy of the original document's metadata
Example:
doc = Document(
page_content="ABCDEFGHIJ",
metadata={"id": "123"}
)
chunks = doc.split_on_num_characters(3)
print(len(chunks)) # 4
print([c.page_content for c in chunks]) # ["ABC", "DEF", "GHI", "J"]
print(chunks[0].metadata) # {"id": "123"}
print(chunks[1].metadata) # {"id": "123"}
Longer Example:
doc = Document(
page_content="The quick brown fox jumps over the lazy dog",
metadata={"source": "example"}
)
chunks = doc.split_on_num_characters(10)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: '{chunk.page_content}'")
# Output:
# Chunk 0: 'The quick '
# Chunk 1: 'brown fox '
# Chunk 2: 'jumps over'
# Chunk 3: ' the lazy '
# Chunk 4: 'dog'
Edge Cases:
# Empty document
doc = Document(page_content="", metadata={})
chunks = doc.split_on_num_characters(10)
print(chunks) # []
# Document smaller than chunk size
doc = Document(page_content="Hello", metadata={})
chunks = doc.split_on_num_characters(100)
print(len(chunks)) # 1
print(chunks[0].page_content) # "Hello"
# Exact multiple
doc = Document(page_content="ABCDEFGHIJKL", metadata={})
chunks = doc.split_on_num_characters(4)
print([c.page_content for c in chunks]) # ["ABCD", "EFGH", "IJKL"]
# Last chunk smaller
doc = Document(page_content="ABCDEFGHIJ", metadata={})
chunks = doc.split_on_num_characters(4)
print([c.page_content for c in chunks]) # ["ABCD", "EFGH", "IJ"]
Word Splitting Demonstration:
# This method WILL split words mid-character
doc = Document(
page_content="Supercalifragilisticexpialidocious",
metadata={}
)
chunks = doc.split_on_num_characters(10)
for chunk in chunks:
print(chunk.page_content)
# Output:
# Supercalif
# ragilistic
# expialidoc
# ious
# Note: Words are split without regard for boundaries
Characteristics:
- Fixed size: All chunks exactly
num_charscharacters (except last) - No overlap: Chunks are consecutive with no shared content
- No boundary respect: Splits at exact character positions
- Simple: Predictable, straightforward splitting
- Metadata preservation: All chunks receive copy of original metadata
Use Cases:
- Fixed-size processing: When exact chunk sizes are required
- Token limit compliance: Ensuring chunks fit within strict limits
- Uniform analysis: When all chunks should have same size
- Simple splitting: When semantic boundaries don't matter
Performance:
import time
from rs_document import Document
doc = Document(page_content="X" * 1_000_000, metadata={})
start = time.time()
chunks = doc.split_on_num_characters(1000)
elapsed = time.time() - start
print(f"Split into {len(chunks)} chunks in {elapsed:.3f} seconds")
# Very fast - simpler algorithm than recursive splitter
When to Use:
Choose split_on_num_characters() when:
- Exact chunk sizes are required
- Semantic boundaries are not important
- You need predictable, uniform chunks
- You're processing text that doesn't have natural structure
Choose recursive_character_splitter() when:
- Semantic coherence matters
- You need context across chunks (overlap)
- Natural language boundaries should be preserved
- You're building RAG or search applications
Method Comparison¶
Feature Comparison Table¶
| Feature | recursive_character_splitter() |
split_on_num_characters() |
|---|---|---|
| Chunk Size | Target (may be smaller) | Exact (except last chunk) |
| Overlap | Yes (~33% hardcoded) | No |
| Boundary Respect | Yes (paragraph → line → word → char) | No |
| Speed | Fast | Very fast |
| Predictability | Chunk sizes vary | Chunk sizes fixed |
| Context Preservation | Good (overlap) | None |
| Use Case | RAG, semantic search, QA | Token limits, uniform processing |
| Best For | Natural language text | Any text, fixed requirements |
Visual Comparison¶
recursive_character_splitter(chunk_size=20):
Original: "The quick brown fox jumps over the lazy dog"
Chunk 1: "The quick brown" (15 chars)
Chunk 2: "brown fox jumps" (15 chars) - overlap: "brown"
Chunk 3: "jumps over the" (14 chars) - overlap: "jumps"
Chunk 4: "the lazy dog" (12 chars) - overlap: "the"
split_on_num_characters(num_chars=20):
Original: "The quick brown fox jumps over the lazy dog"
Chunk 1: "The quick brown fox " (20 chars)
Chunk 2: "jumps over the lazy " (20 chars)
Chunk 3: "dog" (3 chars)
Common Patterns¶
Basic Splitting¶
from rs_document import Document
# Create document
doc = Document(page_content=long_text, metadata={"source": "doc.txt"})
# Split for RAG
chunks = doc.recursive_character_splitter(1000)
# Or split uniformly
chunks = doc.split_on_num_characters(1000)
Clean Then Split¶
doc = Document(page_content=pdf_text, metadata={"source": "doc.pdf"})
# Clean first
doc.clean()
# Then split
chunks = doc.recursive_character_splitter(1000)
Preserving Original¶
# Original document is not modified by splitting
doc = Document(page_content="Original text", metadata={})
chunks = doc.recursive_character_splitter(100)
print(doc.page_content) # Still "Original text"
print(len(chunks)) # 1
Different Chunk Sizes¶
doc = Document(page_content=long_text, metadata={})
# Try different sizes
small_chunks = doc.recursive_character_splitter(500) # More chunks
medium_chunks = doc.recursive_character_splitter(1000) # Balanced
large_chunks = doc.recursive_character_splitter(2000) # Fewer chunks
print(f"500: {len(small_chunks)} chunks")
print(f"1000: {len(medium_chunks)} chunks")
print(f"2000: {len(large_chunks)} chunks")
Metadata Tracking¶
doc = Document(
page_content=long_text,
metadata={"source": "doc.txt", "page": "5", "section": "intro"}
)
chunks = doc.recursive_character_splitter(1000)
# All chunks have same metadata
for i, chunk in enumerate(chunks):
# Could add chunk index to metadata
chunk.metadata["chunk_index"] = str(i)
chunk.metadata["total_chunks"] = str(len(chunks))
print(chunk.metadata)
# {"source": "doc.txt", "page": "5", "section": "intro",
# "chunk_index": "0", "total_chunks": "8"}
Multiple Document Splitting¶
documents = [
Document(page_content=text1, metadata={"id": "1"}),
Document(page_content=text2, metadata={"id": "2"}),
Document(page_content=text3, metadata={"id": "3"}),
]
all_chunks = []
for doc in documents:
chunks = doc.recursive_character_splitter(1000)
all_chunks.extend(chunks)
print(f"Split {len(documents)} documents into {len(all_chunks)} chunks")
Better approach: Use clean_and_split_docs() for parallel processing:
from rs_document import clean_and_split_docs
all_chunks = clean_and_split_docs(documents, chunk_size=1000)
# Faster - processes in parallel
See Also¶
- Document Class - Creating documents
- Cleaning Methods - Clean before splitting
- Utility Functions - Batch splitting with
clean_and_split_docs() - Types and Constants - Splitter defaults and constants