Skip to content

Vajra BM25

Vajra (Sanskrit: वज्र, "thunderbolt") is a high-performance BM25 search engine built with Category Theory abstractions.

  • Fast


    1.3-1.6x faster than BM25S with sub-4ms latency at 1M documents

  • Flexible Formats


    Index JSONL and PDF documents out of the box

  • Clean API


    Simple Python API with categorical abstractions for extensibility

  • Interactive CLI


    Rich command-line interface for exploring search

Quick Example

from vajra_bm25 import DocumentCorpus, VajraSearchOptimized

# Load documents (JSONL, PDF, or directory)
corpus = DocumentCorpus.load("./my_documents/")

# Build search index
engine = VajraSearchOptimized(corpus)

# Search
results = engine.search("machine learning algorithms", top_k=10)

for r in results:
    print(f"{r.rank}. {r.document.title} (score: {r.score:.3f})")

Installation

pip install vajra-bm25[all]

Or install specific features:

pip install vajra-bm25              # Basic (zero dependencies)
pip install vajra-bm25[optimized]   # NumPy/SciPy optimizations
pip install vajra-bm25[pdf]         # PDF support
pip install vajra-bm25[cli]         # Interactive CLI

Performance

Benchmarked on Wikipedia (1M documents, 500 queries):

Engine Build Time Latency QPS
Vajra 17.0 min 3.40ms 294
BM25S 11.3 min 5.44ms 184

See Benchmarks for detailed results.

Why Category Theory?

Vajra uses categorical abstractions to organize code:

  • Morphisms: Composable scoring functions (Query, Doc) → Score
  • Coalgebras: Search as state unfolding State → List[Result]
  • Functors: Container transformations for multi-result semantics

These abstractions don't make Vajra fast (NumPy and sparse matrices do), but they provide a clean, extensible structure. Learn more in Category Theory.

License

MIT License - see LICENSE for details.