Document Formats¶
Vajra supports multiple document formats for flexible data ingestion.
JSONL Format¶
The primary format for structured documents.
File Structure¶
Each line is a JSON object with required fields:
{"id": "doc1", "title": "Document Title", "content": "Full text content here"}
{"id": "doc2", "title": "Another Doc", "content": "More content", "metadata": {"author": "Jane"}}
Fields¶
| Field | Required | Description |
|---|---|---|
id |
Yes | Unique document identifier |
title |
Yes | Document title |
content |
Yes | Full text content (searchable) |
metadata |
No | Optional dictionary of metadata |
Loading JSONL¶
from vajra_bm25 import DocumentCorpus
# Load from file
corpus = DocumentCorpus.load_jsonl("documents.jsonl")
# Save corpus to file
corpus.save_jsonl("output.jsonl")
PDF Format¶
Index and search PDF documents directly.
Installation
PDF support requires: pip install vajra-bm25[pdf]
Single PDF¶
The document will have:
id: Filename without extension (e.g., "research_paper")title: PDF title metadata or filenamecontent: Extracted text from all pagesmetadata:{"source": "path", "format": "pdf", "pages": N, "author": "..."}
Custom Document ID¶
Directory of PDFs¶
# All PDFs in directory
corpus = DocumentCorpus.load_pdf_directory("./papers/")
# Recursive (include subdirectories)
corpus = DocumentCorpus.load_pdf_directory("./papers/", recursive=True)
PDF Metadata¶
Vajra automatically extracts PDF metadata:
corpus = DocumentCorpus.load_pdf("paper.pdf")
doc = corpus.documents[0]
print(doc.metadata)
# {
# "source": "/path/to/paper.pdf",
# "format": "pdf",
# "pages": 12,
# "author": "John Smith"
# }
Auto-Detection¶
The load() method automatically detects format:
# Auto-detects based on path
corpus = DocumentCorpus.load("data.jsonl") # JSONL file
corpus = DocumentCorpus.load("paper.pdf") # PDF file
corpus = DocumentCorpus.load("./papers/") # Directory of PDFs
Explicit Format¶
Override auto-detection with the format parameter:
# Force JSONL parsing on non-standard extension
corpus = DocumentCorpus.load("data.txt", format="jsonl")
# Force PDF directory mode
corpus = DocumentCorpus.load("./mixed_folder/", format="pdf_dir")
Creating Documents Programmatically¶
from vajra_bm25 import Document, DocumentCorpus
# Create individual documents
doc1 = Document(
id="unique_id",
title="My Document",
content="This is the searchable content...",
metadata={"category": "tutorial", "year": 2024}
)
# Build corpus
corpus = DocumentCorpus([doc1, doc2, doc3])
# Or add incrementally
corpus = DocumentCorpus()
corpus.add(doc1)
corpus.add(doc2)
Best Practices¶
- Use unique IDs: Ensure each document has a unique
id - Content quality: Put searchable text in
content, nottitle - PDF text extraction: Works best with text-based PDFs (not scanned images)
- Large corpora: Use JSONL for faster loading vs many individual PDFs