Category Theory in Vajra¶
Vajra uses Category Theory as a principled foundation for its search engine design. This guide explains the key concepts and how they map to search operations.
Why Category Theory?¶
Category theory provides:
- Compositionality: Build complex operations from simple pieces
- Correctness: Mathematical guarantees about behavior
- Optimization: Algebraic laws enable automatic transformations
- Abstraction: Separate "what" from "how"
Core Concepts¶
Objects and Morphisms¶
In category theory, we work with objects (things) and morphisms (transformations between things).
In Vajra:
- Documents are objects
- BM25 scoring is a morphism:
(Query, Document) → Score - Search is a morphism:
Query → List[SearchResult]
Composition¶
Morphisms compose: if f: A → B and g: B → C, then g ∘ f: A → C.
# Text processing pipeline
preprocess = tokenize >> remove_stopwords >> stem
query_tokens = preprocess(query)
This enables:
- Reusable components
- Clear data flow
- Easy testing
The Search Category¶
Vajra defines a category for search:
Objects:
- Query (search query string)
- TokenizedQuery (list of tokens)
- Document (searchable document)
- Score (relevance score in ℝ)
- SearchResult (document + score + rank)
Morphisms:
- tokenize: Query → TokenizedQuery
- score: (TokenizedQuery, Document) → Score
- rank: List[Score] → List[SearchResult]
- search: Query → List[SearchResult] (composite)
Identity and Composition Laws¶
Every category must satisfy:
- Identity:
id ∘ f = f = f ∘ id - Associativity:
(h ∘ g) ∘ f = h ∘ (g ∘ f)
These laws ensure that pipelines behave predictably.
Functors: Structure Preservation¶
A functor maps between categories while preserving structure.
List Functor¶
The List functor lifts single-element operations to work on lists:
# Single document scoring
score_doc: Document → Score
# Lifted to lists (via functor)
score_docs: List[Document] → List[Score]
This is why we can score multiple documents efficiently:
SearchResult Functor¶
Transforms results while preserving ranking structure:
# Add snippets to results without changing ranks
results_with_snippets = result_functor.fmap(add_snippet, results)
Coalgebras: Search as Unfolding¶
A coalgebra models processes that unfold step by step.
Search Unfolding¶
Search is naturally coalgebraic:
State₀ (query)
↓ step
State₁ (tokens + candidates)
↓ step
State₂ (scored documents)
↓ step
State₃ (ranked results) ← terminal
class SearchCoalgebra:
def step(self, state):
if is_terminal(state):
return state.results
new_state = process(state)
return step(new_state)
MaxScore Algorithm¶
The MaxScore early termination algorithm is coalgebraic:
- Start with all documents as candidates
- Score documents by term
- Check upper bound vs current threshold
- Terminate early if no candidates can beat threshold
def maxscore_step(state):
if state.upper_bound < state.min_threshold:
return state.results, TERMINAL # Early exit!
# Continue scoring...
This coalgebraic structure enables principled early termination.
Comonads: Caching with Context¶
A comonad provides context-aware computation.
LRU Cache as Comonad¶
Query caching follows comonadic structure:
class Cache:
def extract(self, key):
"""Get value from cache context"""
return self.cache.get(key)
def extend(self, f, key):
"""Compute in caching context"""
cached = self.extract(key)
if cached:
return cached
result = f(key)
self.store(key, result)
return result
The comonadic laws ensure:
- Consistent extraction: Same key → same value
- Compositional caching: Nested caches work correctly
Monoid Homomorphisms¶
BM25 scoring is a monoid homomorphism.
Monoids¶
A monoid is a set with:
- Binary operation ⊕
- Identity element ε
- Associativity: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
BM25 Scoring Monoid¶
BM25 preserves this structure:
Practical Implications¶
This algebraic property enables:
-
Compositional Scoring
-
Upper Bound Computation
-
Parallel Scoring
The Big Picture¶
┌─────────────────────────────────────────────────────────────┐
│ Search Category │
├─────────────────────────────────────────────────────────────┤
│ │
│ Query ──tokenize──▶ Tokens ──score──▶ Scores ──rank──▶ Results │
│ │
│ Functors: Lift operations to work on lists │
│ Coalgebra: Unfold search step by step, enable early exit │
│ Comonad: Cache with context awareness │
│ Monoid: Enable compositional scoring and bounds │
│ │
└─────────────────────────────────────────────────────────────┘
Performance Impact¶
These abstractions aren't just theoretical—they enable real optimizations:
| Abstraction | Optimization | Impact |
|---|---|---|
| Morphism composition | Fused pipelines | Reduced allocations |
| Functor | Vectorized scoring | 10-50x speedup |
| Coalgebra | MaxScore early exit | 20-40% faster queries |
| Comonad | LRU caching | Near-zero repeated query time |
| Monoid homomorphism | Upper bounds | Early termination |
Further Reading¶
- API Reference: Categorical - Implementation details
- Performance Tips - Practical optimizations
- Benchmarks - Measured impact
External Resources¶
- Category Theory for Programmers by Bartosz Milewski
- Categories for the Working Mathematician by Saunders Mac Lane
- Coalgebraic semantics for streaming/incremental computation