Modern information retrieval systems increasingly combine sparse and dense representations to balance lexical precision with semantic generalization. Traditionally, hybrid retrieval pipelines fetch partial results from multiple indices (e.g., dense embeddings and keyword-based models) and merge them using rank aggregation methods such as Reciprocal Rank Fusion (RRF). While effective, these pipelines often fail to fully leverage the scoring signals of individual retrievers and introduce ranking inconsistencies.
This case study explores using TopK true hybrid retrieval capabilities to improve results quality over rank-fusion approaches. We benchmark four retrieval configurations across several datasets from the BEIR benchmark suite and observe that the TopK hybrid retrieval consistently improves nDCG@10 by up to 7.8% over traditional rank-fusion methods.
Background
Retrieval quality is a critical determinant of downstream application performance in search, recommendation, and question answering systems. Sparse retrieval, powered by models like SPLADE, excels at matching exact terms and handling structured queries, while dense retrieval models capture semantic similarity even when query and document vocabularies diverge.
However, neither paradigm is universally dominant:
- Sparse retrievers falter on paraphrased or conceptually rich queries.
- Dense retrievers often overlook rare terms or domain-specific keywords.
To address this, hybrid systems aggregate results from both models. The most common method is Reciprocal Rank Fusion (RRF), which normalizes ranks from each retriever and combines them into a unified ranking. While RRF is simple and effective, it ignores raw score magnitudes, applies uniform fusion weights, and often limits candidate lists to partial top-k results from each retriever. This can suppress relevant documents ranked moderately by both systems but overlooked by either individually.
TopK Hybrid Search
Our approach leverages TopK's hybrid retrieval capabilities to provide direct, score-aware ranking across multiple retrieval methods. Instead of truncating to partial result sets and applying rank-only fusion, TopK:
- Scores and normalizes results directly from each retriever (dense and sparse), respecting the magnitude of relevance scores rather than solely their ranks.
- Applies a tunable custom scoring function that weights dense vs. sparse contributions dynamically (e.g., emphasizing sparse scores when exact term matches are present, and dense scores otherwise).
- Merges candidates globally rather than pre-truncating, ensuring that documents moderately ranked by both retrievers are surfaced if their combined score is competitive.
- Selects the final top-k results (e.g., top 10 or 100) directly, minimizing recall loss from early-stage truncation.
This approach effectively removes a key bottleneck in hybrid search pipelines: the disconnect between partial recall from individual retrievers and the final relevance ordering.
Here is how the query looks like in TopK SDK:
from topk_sdk.data import f32_vector, f32_sparse_vector
from topk_sdk.query import select, field, fn
collection.query(
select(
# Dense vector score
dense_score=fn.vector_distance("dense", f32_vector([...])),
# Sparse vector score
sparse_score=fn.vector_distance("sparse", f32_sparse_vector({...}))
)
.topk(
# Merge dense and sparse scores
0.7 * field("dense_score") + 0.3 * (field("sparse_score") / 100.0),
# Select top-10 results
10
)
)
Experiments
We evaluated four configurations across multiple datasets from the BEIR benchmark:
- Dense-only retrieval using ModernBERT-base.
- Sparse-only retrieval using SPLADE-v3.
- Traditional hybrid retrieval using Reciprocal Rank Fusion (RRF).
- Hybrid retrieval with TopK, employing a custom scoring function:
alpha * dense_score + (1 - alpha) * sparse_score
.
We used nDCG@10 as the primary metric, reflecting both relevance and ranking position with top_k = 10
results per query. By incorporating scores from both dense and sparse models inside a single query, we achieved an average improvement of 4.58% over RRF-based hybrid systems.
Dataset | Dense-only | Sparse-only | RRF | TopK Hybrid | Improvement |
---|---|---|---|---|---|
FiQA | 0.40661 | 0.38023 | 0.4123 | 0.42853 | 3.94% |
TREC-COVID | 0.81431 | 0.66741 | 0.76779 | 0.82798 | 7.84% |
NQ | 0.52029 | 0.51405 | 0.53885 | 0.55271 | 2.57% |
NFCorpus | 0.32458 | 0.35837 | 0.34593 | 0.36803 | 6.39% |
FEVER | 0.85213 | 0.79154 | 0.84643 | 0.86464 | 2.15% |
Average | 0.583584 | 0.54232 | 0.58226 | 0.608378 | 4.58% |
In practice, people often overfetch k' > k results from individual retrievers and then apply RRF. While this is a valid approach to improve results quality, it often leads to slower queries and higher resource usage. For the sake of completeness, we evaluated RRF with 100 candidates per retriever to get the final top-10 results.
Dataset | RRF (10 candidates) | RRF (100 candidates) | TopK Hybrid | Improvement |
---|---|---|---|---|
FiQA | 0.4123 | 0.41458 | 0.42853 | 3.36% |
TREC-COVID | 0.76779 | 0.80907 | 0.82798 | 2.34% |
NQ | 0.53885 | 0.54093 | 0.55271 | 2.18% |
NFCorpus | 0.34593 | 0.35027 | 0.36803 | 5.07% |
FEVER | 0.84643 | 0.84316 | 0.86464 | 2.55% |
Average | 0.58226 | 0.591602 | 0.608378 | 3.10% |
As the table above shows, RRF with 100 candidates per retriever improves the overall results quality but TopK's hybrid retrieval still outperforms it by 3.10% on average while being more efficient at the same time.
Summary
Our evaluation demonstrates that TopK hybrid retrieval consistently improves results relevance across multiple datasets, achieving a 4.5% average increase in nDCG@10 over reciprocal rank fusion. By directly integrating normalized scores from dense and sparse retrievers, applying tunable weightings, and selecting the final top-k results without intermediate truncation, TopK mitigates recall loss and ranking inconsistencies inherent to partial list aggregation. These results underscore TopK's value as a more principled and efficient alternative to conventional hybrid search pipelines. If you want to learn more about TopK's hybrid search capabilities, check out our documentation.
If you are interested in building high-quality search infrastructure, shoot me an email at marek@topk.io. We’re hiring!