Polish Reranker Large RankNet

Apr 25, 2024 | Educational

This is a Polish text ranking model trained with RankNet loss on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents. The training data includes:

The Polish MS MARCO training split (800k queries)
The ELI5 dataset translated to Polish (over 500k queries)
A collection of Polish medical questions and answers (approximately 100k queries)

As a teacher model, we employed unicamp-dlmt5-13b-mmarco-100k, a large multilingual reranker based on the MT5-XXL architecture. For the student model, we chose Polish RoBERTa. Unlike more commonly used pointwise losses, which regard each query-document pair independently, the RankNet method computes loss based on queries and pairs of documents. More specifically, the loss is computed based on the relative order of documents sorted by their relevance to the query.

To train the reranker, we used the teacher model to assess the relevance of the documents extracted in the retrieval stage for each query. We then sorted these documents by the relevance score, obtaining a dataset consisting of queries and ordered lists of 20 documents per query.

How It Works: Understanding the Analogy

Think of the ranking model as a librarian organizing a huge library filled with thousands of books. Imagine that a person asks a question, “How can I live to be 100 years old?” The librarian needs to find the best answers among millions of books (or documents).

The librarian has two main tools:

A vast database (teacher model) of what is known to be good information.
His own judgment (student model) about which of the found books best answer the question.

First, the librarian looks at all related documents and evaluates how well each one answers the question. Instead of just saying that each book is good or bad individually, the librarian looks at them comparatively: “Book A is better than Book B” but not as good as Book C.

After organizing the books based on their relevance, he hands over the top 20 books that he thinks will help. This is how the Polish Reranker works—it ranks answers relative to one another rather than in isolation, thereby providing the most contextually appropriate response.

Usage with Sentence-Transformers

You can use the model like this with sentence-transformers:

python
from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = CrossEncoder(
    "sdadas/polish-reranker-large-ranknet",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Usage with Huggingface Transformers

The model can also be used with Huggingface Transformers like this:

python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-large-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

texts = [f"{query} {answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())

Evaluation Results

The model achieves NDCG@10 of 62.65 in the Rerankers category of the Polish Information Retrieval Benchmark. See the PIRB Leaderboard for detailed results.

Troubleshooting

If you encounter any issues while using the Polish Reranker, here are some troubleshooting tips:

Ensure that all dependencies are correctly installed. You may want to use a virtual environment to manage your packages.
Double-check your model and tokenizer names to avoid any misspellings.
Verify that your input data format matches what the model expects.
If performance appears below expectations, consider adjusting the hyperparameters.
If you still have questions or need further assistance, remember that helpful resources and communities are available. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox