This code snippet demonstrates the right way to configure and use the jina-colbert-v1-en model for indexing a set of documents, leveraging its ability to handle long contexts efficiently.
Implementing Two-Stage Retrieval with Rerankers
Now that we’ve an understanding of the principles behind two-stage retrieval and rerankers, let’s explore their practical implementation throughout the context of a RAG system. We’ll leverage popular libraries and frameworks to reveal the mixing of those techniques.
Organising the Environment
Before we dive into the code, let’s arrange our development environment. We’ll be using Python and a number of other popular NLP libraries, including Hugging Face Transformers, Sentence Transformers, and LanceDB.
# Install required libraries !pip install datasets huggingface_hub sentence_transformers lancedb
Data Preparation
For demonstration purposes, we’ll use the “ai-arxiv-chunked” dataset from Hugging Face Datasets, which comprises over 400 ArXiv papers on machine learning, natural language processing, and enormous language models.
from datasets import load_dataset
dataset = load_dataset(“jamescalam/ai-arxiv-chunked”, split=”train”)
Next, we’ll preprocess the info and split it into smaller chunks to facilitate efficient retrieval and processing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
def chunk_text(text, chunk_size=512, overlap=64):
tokens = tokenizer.encode(text, return_tensors=”pt”, truncation=True)
chunks = tokens.split(chunk_size – overlap)
texts = [tokenizer.decode(chunk) for chunk in chunks]
return texts
chunked_data = []
for doc in dataset:
text = doc[“chunk”]
chunked_texts = chunk_text(text)
chunked_data.extend(chunked_texts)
For the initial retrieval stage, we'll use a Sentence Transformer model to encode our documents and queries into dense vector representations, after which perform approximate nearest neighbor search using a vector database like LanceDB.
from sentence_transformers import SentenceTransformer from lancedb import lancedb # Load Sentence Transformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Create LanceDB vector store db = lancedb.lancedb('/path/to/store') db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension()) # Index documents for text in chunked_data: vector = model.encode(text).tolist() db.insert_document('docs', vector, text) from sentence_transformers import SentenceTransformer from lancedb import lancedb # Load Sentence Transformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Create LanceDB vector store db = lancedb.lancedb('/path/to/store') db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension()) # Index documents for text in chunked_data: vector = model.encode(text).tolist() db.insert_document('docs', vector, text)
With our documents indexed, we are able to perform the initial retrieval by finding the closest neighbors to a given query vector.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
def chunk_text(text, chunk_size=512, overlap=64):
tokens = tokenizer.encode(text, return_tensors=”pt”, truncation=True)
chunks = tokens.split(chunk_size – overlap)
texts = [tokenizer.decode(chunk) for chunk in chunks]
return texts
chunked_data = []
for doc in dataset:
text = doc[“chunk”]
chunked_texts = chunk_text(text)
chunked_data.extend(chunked_texts)
Reranking
After the initial retrieval, we’ll employ a reranking model to reorder the retrieved documents based on their relevance to the query. In this instance, we’ll use the ColBERT reranker, a quick and accurate transformer-based model specifically designed for document rating.
from lancedb.rerankers import ColbertReranker
reranker = ColbertReranker()
# Rerank initial documents
reranked_docs = reranker.rerank(query, initial_docs)
The reranked_docs
list now comprises the documents reordered based on their relevance to the query, as determined by the ColBERT reranker.
Augmentation and Generation
With the reranked and relevant documents in hand, we are able to proceed to the augmentation and generation stages of the RAG pipeline. We’ll use a language model from the Hugging Face Transformers library to generate the ultimate response.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(“t5-base”)
model = AutoModelForSeq2SeqLM.from_pretrained(“t5-base”)
# Augment query with reranked documents
augmented_query = query + ” ” + ” “.join(reranked_docs[:3])
# Generate response from language model
input_ids = tokenizer.encode(augmented_query, return_tensors=”pt”)
output_ids = model.generate(input_ids, max_length=500)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)