How to Implement the German Uncased Electra Bi-Encoder for Passage Retrieval

Sep 12, 2024 | Educational

Welcome to your comprehensive guide on utilizing the German uncased Electra Bi-Encoder in passage retrieval. This innovative model, based on the German Electra uncased model, is a powerhouse for semantic search, providing an opportunity to explore data in an efficient manner.

Understanding the Model

This model was fine-tuned for passage retrieval using the sentence-transformers package. To prepare the model, the MSMARCO-Passage-Ranking dataset was translated using the fairseq-wmt19-en-de translation model. Think of the Electra model as a student who has thoroughly studied a textbook (the German NLP group’s model) and then been put into a real-world scenario (the passage retrieval task) to apply their knowledge. Fine-tuning is like giving the student specific training on how to ace examinations.

Model Details

Performance Evaluation

The model’s performance was assessed using the GermanDPR testset and benchmarked against the BEIR framework. The table below compares the NDCG and Recall metrics of our model against BM25:

Model        NDCG@1     NDCG@5     NDCG@10    Recall@1   Recall@5   Recall@10
BM25         0.1463     0.3451     0.4097      0.1463     0.5424     0.7415     
Ours         0.4624     0.6218     0.6425      0.4624     0.7581     0.8205

How to Use the Model

Follow these easy steps to implement the model using the sentence-transformers package:

  1. Install the sentence-transformers library if you haven’t done so.
  2. Use the following code snippet to load your bi-encoder:
  3. from sentence_transformers import SentenceTransformer
    bi_model = SentenceTransformer('svalabs/bi-electra-ms-marco-german-uncased')
  4. Next, you can conduct semantic searches using the example code provided below:
  5. import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    K = 3 # number of top ranks to retrieve
    # specify documents and queries
    docs = [
        "Auf Netflix gibt es endlich die neue Staffel meiner Lieblingsserie.",
        "Der Gepard jagt seine Beute.",
        "Wir haben in der Agentur ein neues System für Zeiterfassung.",
        "Mein Arzt sagt, dass mir dabei eher ein Orthopäde helfen könnte.",
        "Einen Impftermin kann mir der Arzt momentan noch nicht anbieten.",
        "Auf Kreta hat meine Tochter mit Muscheln eine schöne Sandburg gebaut.",
        "Das historische Zentrum (centro storico) liegt auf mehr als 100 Inseln in der Lagune von Venedig.",
        "Um in Zukunft sein Vermögen zu schützen, sollte man andere Investmentstrategien in Betracht ziehen.",
        "Die Ära der Dinosaurier wurde vermutlich durch den Einschlag eines gigantischen Meteoriten auf der Erde beendet.",
        "Bei ALDI sind die Bananen gerade im Angebot.",
        "Die Entstehung der Erde ist 4,5 milliarden jahre her.",
        "Finanzwerte treiben DAX um mehr als sechs Prozent nach oben FrankfurtMain gegeben.",
        "DAX dreht ins Minus. Konjunkturdaten und Gewinnmitnahmen belasten FrankfurtMain.",
        
    ]
    queries = [
        "dax steigt",
        "dax sinkt",
        "probleme mit knieschmerzen",
        "software für urlaubsstunden",
        "raubtier auf der jagd",
        "alter der erde",
        "wie alt ist unser planet?",
        "wie kapital sichern",
        "supermarkt lebensmittel reduziert",
        "wodurch ist der tyrannosaurus ausgestorben",
        "serien streamen",
    ]
    
    # encode documents and queries
    features_docs = bi_model.encode(docs)
    features_queries = bi_model.encode(queries)
    
    # compute pairwise cosine similarity scores
    sim = cosine_similarity(features_queries, features_docs)
    
    # print results
    for i, query in enumerate(queries):
        ranks = np.argsort(-sim[i])
        print(f"Query: {query}")
        for j, r in enumerate(ranks[:K]):
            print(f"[{j}: {sim[i, r]:.3f}] {docs[r]}")
        print("-"*96)

Troubleshooting

If you encounter issues while implementing the model, consider the following troubleshooting tips:

  • Ensure all required libraries are installed and up to date.
  • Double-check the model and dataset URLs for accuracy.
  • Validate your queries and documents to make sure they are formatted correctly.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox