BM25 Parameters¶
Vajra implements the standard BM25 ranking algorithm with configurable parameters.
The BM25 Formula¶
\[
\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}
\]
Where:
- \(f(q_i, D)\) = frequency of term \(q_i\) in document \(D\)
- \(|D|\) = document length
- \(\text{avgdl}\) = average document length
- \(k_1\) = term frequency saturation parameter
- \(b\) = length normalization parameter
Parameters¶
k1 (Term Frequency Saturation)¶
Controls how quickly term frequency saturates.
- Default: 1.5
- Range: 0.0 - 3.0 typical
- Effect: Higher values give more weight to term frequency
# Lower k1: Term frequency saturates quickly
engine = VajraSearchOptimized(corpus, k1=0.5)
# Higher k1: More weight to repeated terms
engine = VajraSearchOptimized(corpus, k1=2.5)
| k1 Value | Behavior |
|---|---|
| 0.0 | Binary term presence (ignore frequency) |
| 1.2 | Standard (works well for most corpora) |
| 1.5 | Default (balanced) |
| 2.0+ | Favor documents with many term occurrences |
b (Length Normalization)¶
Controls document length normalization.
- Default: 0.75
- Range: 0.0 - 1.0
- Effect: Higher values penalize longer documents more
# No length normalization
engine = VajraSearchOptimized(corpus, b=0.0)
# Strong length normalization
engine = VajraSearchOptimized(corpus, b=1.0)
| b Value | Behavior |
|---|---|
| 0.0 | No length normalization (favor long docs) |
| 0.5 | Mild normalization |
| 0.75 | Default (balanced) |
| 1.0 | Strong normalization (favor short docs) |
Using BM25Parameters¶
For the base VajraSearch engine:
from vajra_bm25 import VajraSearch, BM25Parameters
params = BM25Parameters(k1=1.2, b=0.75)
engine = VajraSearch(corpus, params=params)
For optimized engines, pass directly:
Tuning Guidelines¶
Short Documents (tweets, titles)¶
Long Documents (articles, papers)¶
Technical/Scientific Corpus¶
Web Search (mixed lengths)¶
Experimentation¶
The best parameters depend on your corpus. Use evaluation metrics to tune:
def evaluate_parameters(corpus, queries, relevant_docs, k1, b):
engine = VajraSearchOptimized(corpus, k1=k1, b=b)
total_recall = 0
for query, relevant in zip(queries, relevant_docs):
results = engine.search(query, top_k=10)
retrieved = {r.document.id for r in results}
recall = len(retrieved & relevant) / len(relevant)
total_recall += recall
return total_recall / len(queries)
# Grid search
for k1 in [1.0, 1.2, 1.5, 2.0]:
for b in [0.5, 0.75, 1.0]:
score = evaluate_parameters(corpus, queries, relevant, k1, b)
print(f"k1={k1}, b={b}: Recall@10 = {score:.3f}")