Cleaning Text¶
Learn how to clean messy text from PDFs, OCR, and web scraping.
Why Clean Documents?¶
Documents from various sources contain artifacts that hurt embedding quality:
- PDFs: Ligatures (æ, fi), bullet symbols, extra spaces
- OCR: Non-ASCII artifacts, broken paragraphs
- Web: HTML entities, inconsistent whitespace
Cleaning normalizes text so embeddings focus on semantic content.
Clean Everything at Once¶
The easiest way to clean a document:
from rs_document import Document
doc = Document(
page_content="● Text with bullets, æ ligatures, and extra spaces!",
metadata={"source": "messy.pdf"}
)
# Run all cleaners
doc.clean()
print(doc.page_content)
# Output: "Text with bullets, ae ligatures, and extra spaces!"
The .clean() method runs all cleaners in the right order.
Individual Cleaners¶
You can also use specific cleaners:
Remove Extra Whitespace¶
Normalize spaces and remove trailing whitespace:
doc = Document(
page_content="ITEM 1. BUSINESS ",
metadata={}
)
doc.clean_extra_whitespace()
print(doc.page_content) # "ITEM 1. BUSINESS"
Convert Ligatures¶
Turn typographic ligatures into regular letters:
doc = Document(
page_content="The encyclopædia has œnology and fish sections",
metadata={}
)
doc.clean_ligatures()
print(doc.page_content)
# "The encyclopaedia has oenology and fish sections"
Common conversions:
- æ → ae
- œ → oe
- fi → fi
- fl → fl
Remove Bullets¶
Clean up bullet symbols from lists:
doc = Document(
page_content="● First item\n● Second item\n● Third item",
metadata={}
)
doc.clean_bullets()
print(doc.page_content)
# "First item\nSecond item\nThird item"
Removes: ●, ○, ■, □, •, and other bullet symbols.
Remove Non-ASCII Characters¶
Strip out non-ASCII characters:
doc = Document(
page_content="Hello\x88World\x89",
metadata={}
)
doc.clean_non_ascii_chars()
print(doc.page_content) # "HelloWorld"
Use carefully: This also removes accented characters and emoji!
Group Broken Paragraphs¶
Fix paragraphs incorrectly split across lines:
doc = Document(
page_content="This is a sentence\nthat was split\nacross lines.\n\nNew paragraph here.",
metadata={}
)
doc.group_broken_paragraphs()
# Joins the broken paragraph while preserving the paragraph break
print(doc.page_content)
# "This is a sentence that was split across lines.\n\nNew paragraph here."
Real-World Example¶
Cleaning a PDF document:
from rs_document import Document
# Load PDF text (using your PDF library)
pdf_text = extract_text_from_pdf("document.pdf")
# Create document
doc = Document(
page_content=pdf_text,
metadata={"source": "document.pdf"}
)
# Clean all issues
doc.clean()
# Now the text is ready for embedding
print(f"Cleaned {len(doc.page_content)} characters")
Cleaning Multiple Documents¶
Process a batch of documents:
from rs_document import Document
documents = [
Document(page_content="messy text 1", metadata={"id": "1"}),
Document(page_content="messy text 2", metadata={"id": "2"}),
# ... more documents
]
# Clean each document
for doc in documents:
doc.clean()
print(f"Cleaned {len(documents)} documents")
Understanding Cleaner Order¶
When you call .clean(), cleaners run in this order:
clean_extra_whitespace- Normalize spacing firstclean_ligatures- Convert to standard charactersclean_bullets- Remove list markersclean_non_ascii_chars- Remove remaining non-ASCIIgroup_broken_paragraphs- Fix structure last
This order ensures each cleaner works with normalized input.
Important Notes¶
Cleaning Modifies In-Place¶
Cleaning changes the document directly:
doc = Document(page_content="original text", metadata={})
# This modifies doc
doc.clean()
# Original content is gone
If you need the original, create a copy first:
from rs_document import Document
original = Document(page_content="text", metadata={"id": "1"})
# Make a copy
cleaned = Document(
page_content=original.page_content,
metadata=original.metadata.copy()
)
# Clean the copy
cleaned.clean()
# Now you have both versions
Metadata is Preserved¶
Cleaning never touches metadata:
doc = Document(
page_content="● messy text",
metadata={"source": "file.txt", "page": "5"}
)
doc.clean()
# Metadata unchanged
assert doc.metadata == {"source": "file.txt", "page": "5"}
Next Steps¶
Now that your text is clean, let's learn how to split documents into chunks for embeddings!