DeepCT-IRBERT: Hybrid Retrieval Pipeline for MS-MARCO

Introduction

In the field of Information Retrieval (IR), the challenge of effectively ranking documents remains a central problem. Traditional methods like BM25 have served as the backbone of retrieval systems for decades, while modern transformer-based models have shown remarkable semantic understanding capabilities. My research explores the synergy between these two approaches through a hybrid pipeline called DeepCT-IRBERT.

The Problem

The MS-MARCO dataset presents a significant challenge for retrieval systems:

6,980 queries or questions
1,000 candidate passages per query
Need to identify the most relevant passages

The goal is to maximize MRR@10 (Mean Reciprocal Rank at 10), which measures how highly the first relevant document ranks in the top 10 results.

Methodology

DeepCT: Context-Aware Term Weighting

DeepCT, introduced by Dai & Callan (2019, 2020), represents a paradigm shift in term weighting. Instead of using traditional TF-IDF or BM25 term frequencies, DeepCT uses a deep learning model to learn context-aware term importance.

Key insights:

Terms that are important in one context may be less important in another
DeepCT can identify semantic term importance that traditional methods miss
When used to reweight BM25 indexing, it significantly improves retrieval performance

IR-BERT: Semantic Search with Transformers

IR-BERT, proposed by Deshmukh & Sethi (2020), leverages the power of BERT (and SBERT by Reimers & Gurevych, 2019) to capture semantic relationships between queries and documents.

Benefits:

Understands contextual meaning beyond keyword matching
Captures semantic similarity between query and passage
Provides dense semantic representations

The Hybrid Approach

My implementation combines:

BM25 indexed with DeepCT term weighting - leveraging learned term importance
IR-BERT for first-stage retrieval - capturing semantic understanding

This hybrid approach aims to get the best of both worlds:

The efficiency and recall of sparse retrieval (BM25 + DeepCT)
The semantic precision of dense retrieval (IR-BERT)

Implementation Details

The project is implemented in Jupyter Notebook and includes:

Data Loading: MS-MARCO passage ranking dataset
DeepCT Indexing: Reweighting document terms using learned weights
IR-BERT Encoding: Generating dense embeddings for queries and passages
Hybrid Fusion: Combining scores from both approaches
Evaluation: Computing MRR@10 on the test set

Key Findings

Through experimentation, I observed:

DeepCT reweighting significantly improves over raw BM25
IR-BERT provides strong semantic matching but can miss exact keyword matches
Hybrid approaches that combine both methods tend to perform best

References

Dai, Z. & Callan, J. (2019). Deeper Text Understanding for IR with Contextual Neural Language Model
Dai, Z. & Callan, J. (2020). Context-Aware Document Term Weighting for Ad-Hoc Search
Deshmukh, A. & Sethi, U. (2020). IR-BERT: Leveraging BERT for Semantic Search
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Conclusion

The DeepCT-IRBERT hybrid pipeline demonstrates that combining traditional IR techniques with modern deep learning can yield powerful results. The key insight is that different retrieval methods capture different aspects of relevance, and thoughtfully combining them can outperform any single approach.

This research was conducted as part of my Master's degree in Computer Science at the University of Indonesia, focusing on Generative AI and Information Retrieval.

Would you like to explore the implementation? Check out the GitHub repository for the full code.