Text Cleaning¶
Text cleaning is the critical first step in document processing for RAG applications. Understanding why cleaning matters and what each operation does helps you use rs_document effectively.
Why Clean Documents?¶
Documents from various sources contain artifacts that don't carry semantic meaning but hurt embedding quality and retrieval accuracy.
The Problem: Noise in Embeddings¶
Embedding models encode text into vectors representing semantic meaning. Artifacts create noise:
# Original text from PDF
text1 = "The first step is to find the optimal solution."
# After cleaning
text2 = "The first step is to find the optimal solution."
# Without cleaning
embedding1 = model.embed(text1) # Encodes "fi" ligature as unique token
# With cleaning
embedding2 = model.embed(text2) # Encodes "fi" as standard token
# Query: "first step"
query_embedding = model.embed("first step")
# Similarity (higher is better)
similarity(query_embedding, embedding1) = 0.85 # Lower due to ligature
similarity(query_embedding, embedding2) = 0.95 # Higher, clean match
The cleaner text produces embeddings that better match queries.
The Problem: Token Inefficiency¶
Artifacts waste token budget:
# With extra whitespace and bullets
text = "● Key point: This is important. "
tokens = 12
# After cleaning
text = "Key point: This is important."
tokens = 6
Cleaning reduces token count by 50%, allowing more content in each chunk.
Sources of Artifacts¶
Different document sources produce different artifacts:
PDF Extraction¶
PDFs are rendered for display, not text extraction. Common artifacts:
Ligatures: Typographic ligatures become special characters
fi→fi(single character)fl→flae→æoe→œ
Bullets: Rendered symbols become unicode characters
- List bullets:
●,■,•,◆ - Checkboxes:
☐,☑,☒
Extra Whitespace: Table formatting creates multiple spaces
Broken Paragraphs: Multi-column layouts split paragraphs
OCR Output¶
OCR (Optical Character Recognition) produces recognition errors:
Non-ASCII Artifacts: Characters misrecognized as symbols
l(lowercase L) →|(pipe)0(zero) →O(letter O)- Accent marks:
e→é,n→ñ
Whitespace Issues: Inconsistent spacing from layout detection
Broken Words: Characters split incorrectly
Web Scraping¶
HTML conversion creates formatting artifacts:
HTML Entities: Special characters encoded
→ extra spaces—→—"→"
List Markers: HTML lists become text bullets
Extra Whitespace: CSS spacing becomes text spaces
Available Cleaners¶
rs_document provides five cleaners, each targeting specific artifacts:
clean_ligatures()¶
What it does: Converts typographic ligatures to component letters
Transformations:
"fi" → "fi"
"fl" → "fl"
"ff" → "ff"
"ffi" → "ffi"
"ffl" → "ffl"
"æ" → "ae"
"œ" → "oe"
"Æ" → "AE"
"Œ" → "OE"
Why it matters: Ligatures are common in professionally typeset PDFs (books, papers, reports). Without cleaning:
# Query won't match because of ligature
query = "first"
text = "The first step" # Contains U+FB01 (fi ligature)
# String match: False
# Similarity: Low
After cleaning:
Use case: Essential for any PDF content, especially academic papers, books, and professional reports.
clean_bullets()¶
What it does: Removes bullet point characters while preserving list structure
Removes:
Example:
Why it matters: Bullet characters don't carry semantic meaning and waste tokens. The list structure (line breaks) is preserved, which is what matters for understanding.
Use case: Documents with lists (presentations, reports, documentation).
clean_extra_whitespace()¶
What it does: Normalizes all whitespace
Operations:
- Replace multiple spaces with single space:
"a b"→"a b" - Remove leading whitespace:
" text"→"text" - Remove trailing whitespace:
"text "→"text" - Normalize line endings:
"\r\n"→"\n"
Example:
# Before
" This has extra spaces. \n And leading spaces. "
# After
"This has extra spaces.\nAnd leading spaces."
Why it matters:
- Reduces token count (multiple spaces are multiple tokens)
- Improves embedding consistency
- Removes visual formatting that doesn't carry meaning
Use case: All documents, but especially PDFs with table formatting and OCR output.
clean_non_ascii_chars()¶
What it does: Removes all characters outside the ASCII range (0-127)
Removes:
- Accented characters:
é,ñ,ü - Symbols:
™,©,° - Emoji: 😀, 👍, ❤️
- Special punctuation:
—,…,'
Example:
⚠️ Warning: This cleaner is aggressive and removes useful information in many cases.
Use cases:
- Legacy systems requiring ASCII-only text
- Specific embedding models trained only on ASCII
- English-only content where non-ASCII is truly noise
Avoid for:
- Multilingual content (removes non-English characters)
- Modern systems (most support Unicode)
- Content with intentional symbols or emoji
Alternative: Only use if you have a specific requirement for ASCII-only text. Otherwise, skip this cleaner.
group_broken_paragraphs()¶
What it does: Rejoins paragraphs incorrectly split by PDF extraction
Algorithm:
- Find lines that end without punctuation
- If next line doesn't start with capital letter
- Merge lines into single paragraph
Example:
# Before (broken paragraph from 2-column PDF)
"This is a sentence that was\nsplit across lines because\nof the column layout."
# After
"This is a sentence that was split across lines because of the column layout."
Why it matters: Broken paragraphs create artificial semantic boundaries. Retrieval systems might treat them as separate thoughts.
Use case: PDFs with multi-column layouts, scanned documents with OCR.
Note: This cleaner is conservative—it only merges when confident the split was erroneous. It won't merge legitimate line breaks.
The .clean() Method¶
The .clean() method runs all cleaners in a specific order:
doc = Document(text, metadata)
doc.clean()
# Equivalent to:
# doc.clean_extra_whitespace()
# doc.clean_ligatures()
# doc.clean_bullets()
# doc.clean_non_ascii_chars()
# doc.group_broken_paragraphs()
Why This Order?¶
The sequence matters for best results:
1. clean_extra_whitespace() first
- Normalizes input for other cleaners
- Ensures consistent spacing for pattern matching
- Reduces noise before other operations
2. clean_ligatures() second
- Converts to standard ASCII letters
- Ensures subsequent cleaners work with normalized text
3. clean_bullets() third
- Removes symbols after ligatures are normalized
- Operates on clean whitespace
4. clean_non_ascii_chars() fourth
- Removes remaining non-ASCII after ligatures converted
- Operates on text with normalized spacing and bullets removed
5. group_broken_paragraphs() last
- Works with fully cleaned text
- Merges paragraphs after all character-level cleaning
- Benefits from normalized whitespace
Running in different order could miss artifacts or make incorrect decisions.
Selective Cleaning¶
You can run cleaners individually if you don't need all of them:
from rs_document import Document
doc = Document(text, metadata)
# Only clean whitespace and ligatures
doc.clean_extra_whitespace()
doc.clean_ligatures()
# Skip bullets, non-ASCII, and paragraph grouping
Common combinations:
Minimal cleaning (fastest, least aggressive):
Standard cleaning (good for most PDFs):
doc.clean_extra_whitespace()
doc.clean_ligatures()
doc.clean_bullets()
doc.group_broken_paragraphs()
# Skip clean_non_ascii_chars()
Aggressive cleaning (ASCII-only systems):
Performance Characteristics¶
Cleaning operations are fast in rs_document:
| Operation | Time (single doc) | Speedup vs Python |
|---|---|---|
clean_extra_whitespace() |
< 0.1ms | 15-20x |
clean_ligatures() |
< 0.1ms | 25-30x |
clean_bullets() |
< 0.1ms | 20-25x |
clean_non_ascii_chars() |
< 0.1ms | 30-40x |
group_broken_paragraphs() |
< 0.5ms | 50-75x |
clean() (all) |
< 1ms | 20-25x |
For batch processing:
# 10,000 documents
docs = [Document(text, {}) for text in texts]
# Clean all
for doc in docs:
doc.clean()
# Total time: ~10 seconds
# Python equivalent: ~4 minutes
Best Practices¶
Always Clean Before Splitting¶
Clean first, then split:
# Correct
doc.clean()
chunks = doc.recursive_character_splitter(1000)
# Wrong (splits uncleaned text)
chunks = doc.recursive_character_splitter(1000)
for chunk in chunks:
chunk.clean() # Cleaning each chunk separately
Cleaning before splitting:
- More efficient (clean once vs clean N chunks)
- Better chunk boundaries (splitting operates on clean text)
- More consistent results
Use clean_and_split_docs() for Batches¶
For multiple documents, use the batch function:
This:
- Cleans and splits in one pass
- Processes documents in parallel
- Returns all chunks from all documents
Consider Your Content Type¶
Adjust cleaning based on source:
PDFs from books/papers:
OCR output:
Web content:
doc.clean_extra_whitespace()
doc.clean_bullets()
# Skip ligatures (rare in HTML) and non-ASCII (preserve content)
Clean text (already processed):
Testing Cleaning Impact¶
Evaluate cleaning on sample documents:
from rs_document import Document
# Before cleaning
doc1 = Document(original_text, {})
embedding1 = embed_model.embed(doc1.page_content)
# After cleaning
doc2 = Document(original_text, {})
doc2.clean()
embedding2 = embed_model.embed(doc2.page_content)
# Compare
print(f"Original: {len(doc1.page_content)} chars")
print(f"Cleaned: {len(doc2.page_content)} chars")
print(f"Reduction: {len(doc1.page_content) - len(doc2.page_content)} chars")
# Test retrieval
query = "your test query"
sim1 = similarity(embed_model.embed(query), embedding1)
sim2 = similarity(embed_model.embed(query), embedding2)
print(f"Similarity improvement: {sim2 - sim1:.3f}")
Limitations¶
Understanding what cleaners don't do:
No Spelling Correction: OCR errors like "recieve" → "receive" aren't fixed
No Grammar Fix: Broken sentences aren't reconstructed
No Language Translation: Non-English text isn't translated
No Semantic Cleaning: Meaningless content (lorem ipsum) isn't detected
No HTML Removal: HTML tags aren't removed (use an HTML parser first)
Cleaners focus on formatting artifacts, not content quality.
Comparison with Alternatives¶
vs Unstructured.io¶
Similarities:
- rs_document implements same core cleaners
- Logic matches Unstructured.io's behavior
Differences:
- rs_document: Faster (15-75x per cleaner)
- Unstructured.io: More cleaners available (dashes, ordered bullets, etc.)
- Unstructured.io: More configuration options
When to use each:
- rs_document: Speed matters, core cleaners sufficient
- Unstructured.io: Need specialized cleaners or fine-grained control
vs LangChain¶
LangChain doesn't provide comprehensive text cleaners. Use rs_document for cleaning, LangChain for other pipeline steps.
vs Custom Regex¶
Writing custom regex cleaners:
Advantages of custom:
- Tailored to your specific artifacts
- Full control over behavior
Advantages of rs_document:
- Pre-tested, reliable implementations
- Significantly faster
- Less code to maintain
Use rs_document cleaners as a baseline, add custom cleaning only for domain-specific artifacts.
Summary¶
Text cleaning is essential for high-quality RAG:
- Why: Removes artifacts that hurt embedding quality and waste tokens
- What: Five cleaners target different artifact types
- How: Run
.clean()or selective cleaners based on content type - When: Always clean before splitting for best results
Proper cleaning improves retrieval accuracy and reduces token costs—making it a high-impact, low-effort optimization.
Next: Performance - Understanding what makes rs_document fast