Vajra BM25¶
Vajra (Sanskrit: वज्र, "thunderbolt") is a high-performance BM25 search engine built with Category Theory abstractions.
-
Fast
1.3-1.6x faster than BM25S with sub-4ms latency at 1M documents
-
Flexible Formats
Index JSONL and PDF documents out of the box
-
Clean API
Simple Python API with categorical abstractions for extensibility
-
Interactive CLI
Rich command-line interface for exploring search
Quick Example¶
from vajra_bm25 import DocumentCorpus, VajraSearchOptimized
# Load documents (JSONL, PDF, or directory)
corpus = DocumentCorpus.load("./my_documents/")
# Build search index
engine = VajraSearchOptimized(corpus)
# Search
results = engine.search("machine learning algorithms", top_k=10)
for r in results:
print(f"{r.rank}. {r.document.title} (score: {r.score:.3f})")
Installation¶
Or install specific features:
pip install vajra-bm25 # Basic (zero dependencies)
pip install vajra-bm25[optimized] # NumPy/SciPy optimizations
pip install vajra-bm25[pdf] # PDF support
pip install vajra-bm25[cli] # Interactive CLI
Performance¶
Benchmarked on Wikipedia (1M documents, 500 queries):
| Engine | Build Time | Latency | QPS |
|---|---|---|---|
| Vajra | 17.0 min | 3.40ms | 294 |
| BM25S | 11.3 min | 5.44ms | 184 |
See Benchmarks for detailed results.
Why Category Theory?¶
Vajra uses categorical abstractions to organize code:
- Morphisms: Composable scoring functions
(Query, Doc) → Score - Coalgebras: Search as state unfolding
State → List[Result] - Functors: Container transformations for multi-result semantics
These abstractions don't make Vajra fast (NumPy and sparse matrices do), but they provide a clean, extensible structure. Learn more in Category Theory.
License¶
MIT License - see LICENSE for details.