DeepCT-IRBERT: Hybrid Retrieval Pipeline for MS-MARCO
DeepCT-IRBERT: Hybrid Retrieval Pipeline for MS-MARCO
Introduction
In the field of Information Retrieval (IR), the challenge of effectively ranking documents remains a central problem. Traditional methods like BM25 have served as the backbone of retrieval systems for decades, while modern transformer-based models have shown remarkable semantic understanding capabilities. My research explores the synergy between these two approaches through a hybrid pipeline called DeepCT-IRBERT.
The Problem
The MS-MARCO dataset presents a significant challenge for retrieval systems:
- 6,980 queries or questions
- 1,000 candidate passages per query
- Need to identify the most relevant passages
The goal is to maximize MRR@10 (Mean Reciprocal Rank at 10), which measures how highly the first relevant document ranks in the top 10 results.
Methodology
DeepCT: Context-Aware Term Weighting
DeepCT, introduced by Dai & Callan (2019, 2020), represents a paradigm shift in term weighting. Instead of using traditional TF-IDF or BM25 term frequencies, DeepCT uses a deep learning model to learn context-aware term importance.
Key insights:
- Terms that are important in one context may be less important in another
- DeepCT can identify semantic term importance that traditional methods miss
- When used to reweight BM25 indexing, it significantly improves retrieval performance
IR-BERT: Semantic Search with Transformers
IR-BERT, proposed by Deshmukh & Sethi (2020), leverages the power of BERT (and SBERT by Reimers & Gurevych, 2019) to capture semantic relationships between queries and documents.
Benefits:
- Understands contextual meaning beyond keyword matching
- Captures semantic similarity between query and passage
- Provides dense semantic representations
The Hybrid Approach
My implementation combines:
- BM25 indexed with DeepCT term weighting - leveraging learned term importance
- IR-BERT for first-stage retrieval - capturing semantic understanding
This hybrid approach aims to get the best of both worlds:
- The efficiency and recall of sparse retrieval (BM25 + DeepCT)
- The semantic precision of dense retrieval (IR-BERT)
Implementation Details
The project is implemented in Jupyter Notebook and includes:
- Data Loading: MS-MARCO passage ranking dataset
- DeepCT Indexing: Reweighting document terms using learned weights
- IR-BERT Encoding: Generating dense embeddings for queries and passages
- Hybrid Fusion: Combining scores from both approaches
- Evaluation: Computing MRR@10 on the test set
Key Findings
Through experimentation, I observed:
- DeepCT reweighting significantly improves over raw BM25
- IR-BERT provides strong semantic matching but can miss exact keyword matches
- Hybrid approaches that combine both methods tend to perform best
References
- Dai, Z. & Callan, J. (2019). Deeper Text Understanding for IR with Contextual Neural Language Model
- Dai, Z. & Callan, J. (2020). Context-Aware Document Term Weighting for Ad-Hoc Search
- Deshmukh, A. & Sethi, U. (2020). IR-BERT: Leveraging BERT for Semantic Search
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Conclusion
The DeepCT-IRBERT hybrid pipeline demonstrates that combining traditional IR techniques with modern deep learning can yield powerful results. The key insight is that different retrieval methods capture different aspects of relevance, and thoughtfully combining them can outperform any single approach.
This research was conducted as part of my Master's degree in Computer Science at the University of Indonesia, focusing on Generative AI and Information Retrieval.
Would you like to explore the implementation? Check out the GitHub repository for the full code.