How to Use MiniCPM-2B for Text Embedding Tasks

Aug 7, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_2_244

In the realm of natural language processing, turning text into a format that machines can interpret is crucial. Enter the MiniCPM-2B-Text-Embedding model, which is a fine-tuned version tailored specifically for text embedding tasks. With its strong foundational model and advanced fine-tuning techniques, you can achieve impressive results in sentence similarity and feature extraction!

What is MiniCPM-2B-Text-Embedding?

This model utilizes Contrastive Fine-tuning and the LoRA technique on Natural Language Inference (NLI) datasets. If you’re looking to harness the power of this model, follow the steps below!

Getting Started with MiniCPM-2B-Text-Embedding

Clone the MiniCPM-2B-dpo-bf16 repository:
Open your terminal and run:
```
bash
git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
```
Update the tokenizer settings:
Locate the tokenizer_config.json file and add a new line:
```
json
add_eos_token: true
```

Load and use the model:

Here’s a sample code snippet to demonstrate how to use the model:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class MiniCPMSentenceEmbedding:
    def __init__(self, model_path="openbmb/MiniCPM-2B-dpo-bf16", adapter_path=None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(model_path,
            torch_dtype=torch.bfloat16,
            device_map="cuda",
            trust_remote_code=True)
        if adapter_path:
            self.model.load_adapter(adapter_path)

    def get_last_hidden_state(self, text):
        inputs = self.tokenizer(text, return_tensors="pt").to("cuda")
        with torch.no_grad():
            out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
        return out.squeeze().float().cpu().numpy()

    def encode(self, sentences: list[str]) -> list[np.ndarray]:
        out = []
        for s in sentences:
            out.append(self.get_last_hidden_state(s))
        return out

minicpm_sentence_embedding = MiniCPMSentenceEmbedding("your-cloned-base-model-path", "trapoom555/MiniCPM-2B-Text-Embedding-cft")
example_sentences = ["I don't like apples", "I like apples"]
encoded_sentences = minicpm_sentence_embedding.encode(example_sentences)
print(encoded_sentences)

Understanding the Code: The Art of Sentence Embedding

Imagine your sentences are popsicles melting in the sun. Each sentence starts as a unique shape but, over time, with the right heat (in this case, the model), they transform into a delicious, uniform liquid – the embeddings. This process involves:

Initialization: Just like preparing the popsicle molds, the MiniCPMSentenceEmbedding class initializes the tokenizer and model.
Getting Last Hidden State: The model analyzes the shapes of our popsicles (sentences) and extracts features, transforming them into a final form that retains the essence (the embeddings).
Encoding: All sentence popsicles are lined up and processed together, resulting in a list of embeddings that you can use as inputs for further AI tasks.

Training Details

When diving into training, it’s good to know the performance metrics:

Loss: InfoNCE
Batch Size: 60
Learning Rate: 5e-05
Max Epochs: 1
And more…

Troubleshooting Tips

Sometimes, things might not go as planned while working with the MiniCPM-2B model. Here are some troubleshooting ideas:

Model Loading Issues: Ensure your paths are correct and the pre-trained model is fully downloaded.
CUDA Errors: If you encounter device-related issues, verify that your GPU drivers are up to date.
Memory Errors: If the model crashes due to memory limits, consider reducing the batch size.

If problems persist, reach out for support and consult resources from fxis.ai. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Model Evaluation

The evaluation results demonstrate significant improvements before and after fine-tuning with benchmarks like STS and SICK-R, indicating the model’s effectiveness.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox