How to Use MiniCPM-2B for Text Embedding

Aug 7, 2024 | Educational

In the ever-evolving world of Natural Language Processing, text embedding is a cornerstone that allows us to transform human language into a format that machines can understand. One of the powerful tools to accomplish this is MiniCPM-2B, a fine-tuned model designed for text embedding tasks. This guide will walk you through the process of setting up and using MiniCPM-2B for your projects.

What is MiniCPM-2B?

MiniCPM-2B-Text-Embedding is a refined version of the MiniCPM-2B-dpo-bf16 model. This model has been fine-tuned using Contrastive Fine-tuning and the LoRA technique, specifically on Natural Language Inference (NLI) datasets. This fine-tuning allows more nuanced understanding and embedding of sentences, which can significantly enhance performance for various NLP tasks.

Getting Started

To leverage the full potential of MiniCPM-2B, follow these steps:

Clone the MiniCPM-2B-dpo-bf16 repository by executing the following command:

git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16

Change the tokenizer setting in the tokenizer_config.json file to include the end-of-sequence token:

"add_eos_token": true

Use the model by following the provided Python code example.

Implementing the Model

Let’s get into the nitty-gritty of how to implement and use the MiniCPM sentence embedding model.

Imagine you are a chef preparing a recipe. You gather all your ingredients (data) and tools (functions) in the kitchen (your code). MiniCPM acts as your magical oven, which takes your ingredients and transforms them into a delicious dish (the embeddings). Here’s how you set it up:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class MiniCPMSentenceEmbedding:
    def __init__(self, model_path='openbmb/MiniCPM-2B-dpo-bf16', adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path,
                                                           torch_dtype=torch.bfloat16,
                                                           device_map='cuda',
                                                           trust_remote_code=True)
        if adapter_path != None:
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences):
        out = []
        for s in sentences:
            out.append(self.get_last_hidden_state(s))
        return out

minicpm_sentence_embedding = MiniCPMSentenceEmbedding(, 'trapoom555/MiniCPM-2B-Text-Embedding-cft')
example_sentences = ["I don't like apples", "I like apples"]
encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
print(encoded_sentences)

Understanding the Code

This code is designed to create sentence embeddings using the MiniCPM-2B model. Here’s a breakdown of the key components:

Initialization: The class MiniCPMSentenceEmbedding initializes the model and tokenizer using the provided model path, similar to preparing your cooking environment.
get_last_hidden_state: This function can be compared to pressing the bake button on your oven. It takes a sentence and outputs its corresponding embedding as if it were turning the raw ingredients into a finished dish.
encode: Looping over each sentence and calling the get_last_hidden_state function, this method collects all the embeddings into a list, akin to gathering all your finished dishes for serving.

Troubleshooting Tips

Here are some common issues you might encounter when using MiniCPM-2B and how to address them:

CUDA Errors: Ensure that your GPU drivers are up to date and that you have the necessary library dependencies installed. Sometimes CUDA compatibility issues can arise.
Memory Errors: If you encounter out-of-memory errors, consider reducing the batch size or using a GPU with more memory.
Model Not Found: Double-check that you have cloned the model repository correctly and pointed to the correct path.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Text embedding using MiniCPM-2B brings a powerful capability to your NLP projects. Whether you’re working on sentiment analysis, text similarity, or other language-related tasks, this model can help you achieve remarkable results.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox