Your First Document¶
Learn how to create and work with Document objects in rs_document.
Creating a Document¶
A Document has two parts:
- page_content: The text content
- metadata: String key-value pairs for tracking information
Let's create one:
from rs_document import Document
doc = Document(
page_content="This is my first document with some text content.",
metadata={"source": "tutorial.txt", "page": "1"}
)
print(doc)
Output:
Document(page_content="This is my first document...", metadata={"source": "tutorial.txt", "page": "1"})
Understanding Document Components¶
Page Content¶
The text content of your document:
doc = Document(
page_content="Hello, world!",
metadata={}
)
# Access the content
print(doc.page_content) # "Hello, world!"
# Modify it
doc.page_content = "Goodbye, world!"
print(doc.page_content) # "Goodbye, world!"
Metadata¶
Metadata stores information about the document:
doc = Document(
page_content="Document text",
metadata={
"source": "article.txt",
"author": "Jane Doe",
"date": "2024-01-01",
"category": "tutorial"
}
)
# Access metadata
print(doc.metadata["source"]) # "article.txt"
# Add more metadata
doc.metadata["page"] = "5"
# View all metadata
print(doc.metadata)
Important: Metadata values must be strings. Convert other types:
# Wrong - will cause errors
metadata = {"page": 5, "score": 0.95}
# Correct - convert to strings
metadata = {"page": "5", "score": "0.95"}
doc = Document(page_content="text", metadata=metadata)
Creating Documents from Files¶
Load content from a text file:
from rs_document import Document
# Read file
with open("document.txt", "r", encoding="utf-8") as f:
content = f.read()
# Create document
doc = Document(
page_content=content,
metadata={
"source": "document.txt",
"path": "/path/to/document.txt"
}
)
Working with Multiple Documents¶
Create a list of documents:
from rs_document import Document
documents = []
for i in range(5):
doc = Document(
page_content=f"Content of document {i}",
metadata={"doc_id": str(i)}
)
documents.append(doc)
print(f"Created {len(documents)} documents")
Viewing Documents¶
Documents have a helpful string representation:
doc = Document(
page_content="Short content",
metadata={"id": "123"}
)
# Print the document
print(doc)
# Convert to string
doc_string = str(doc)
Common Patterns¶
Document from User Input¶
user_text = input("Enter your text: ")
doc = Document(
page_content=user_text,
metadata={"source": "user_input"}
)
Document with Timestamps¶
from datetime import datetime
doc = Document(
page_content="Current content",
metadata={
"created_at": datetime.now().isoformat(),
"source": "app"
}
)
Empty Document¶
# Empty document (useful as placeholder)
doc = Document(page_content="", metadata={})
# Check if empty
if not doc.page_content:
print("Document is empty")
Next Steps¶
Now that you know how to create documents, let's learn how to clean text to remove artifacts and normalize formatting!