In the realm of natural language processing, turning text into a format that machines can interpret is crucial. Enter the MiniCPM-2B-Text-Embedding model, which is a fine-tuned version tailored specifically for text embedding tasks. With its strong foundational model and advanced fine-tuning techniques, you can achieve impressive results in sentence similarity and feature extraction!
What is MiniCPM-2B-Text-Embedding?
This model utilizes Contrastive Fine-tuning and the LoRA technique on Natural Language Inference (NLI) datasets. If you’re looking to harness the power of this model, follow the steps below!
Getting Started with MiniCPM-2B-Text-Embedding
-
Clone the MiniCPM-2B-dpo-bf16 repository:
Open your terminal and run:
bash git clone https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16
-
Update the tokenizer settings:
Locate the
tokenizer_config.json
file and add a new line:json add_eos_token: true
-
Load and use the model:
Here’s a sample code snippet to demonstrate how to use the model:
python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import numpy as np class MiniCPMSentenceEmbedding: def __init__(self, model_path="openbmb/MiniCPM-2B-dpo-bf16", adapter_path=None): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True) if adapter_path: self.model.load_adapter(adapter_path) def get_last_hidden_state(self, text): inputs = self.tokenizer(text, return_tensors="pt").to("cuda") with torch.no_grad(): out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :] return out.squeeze().float().cpu().numpy() def encode(self, sentences: list[str]) -> list[np.ndarray]: out = [] for s in sentences: out.append(self.get_last_hidden_state(s)) return out minicpm_sentence_embedding = MiniCPMSentenceEmbedding("your-cloned-base-model-path", "trapoom555/MiniCPM-2B-Text-Embedding-cft") example_sentences = ["I don't like apples", "I like apples"] encoded_sentences = minicpm_sentence_embedding.encode(example_sentences) print(encoded_sentences)
Understanding the Code: The Art of Sentence Embedding
Imagine your sentences are popsicles melting in the sun. Each sentence starts as a unique shape but, over time, with the right heat (in this case, the model), they transform into a delicious, uniform liquid – the embeddings. This process involves:
- Initialization: Just like preparing the popsicle molds, the
MiniCPMSentenceEmbedding
class initializes the tokenizer and model. - Getting Last Hidden State: The model analyzes the shapes of our popsicles (sentences) and extracts features, transforming them into a final form that retains the essence (the embeddings).
- Encoding: All sentence popsicles are lined up and processed together, resulting in a list of embeddings that you can use as inputs for further AI tasks.
Training Details
When diving into training, it’s good to know the performance metrics:
- Loss: InfoNCE
- Batch Size: 60
- Learning Rate: 5e-05
- Max Epochs: 1
- And more…
Troubleshooting Tips
Sometimes, things might not go as planned while working with the MiniCPM-2B model. Here are some troubleshooting ideas:
- Model Loading Issues: Ensure your paths are correct and the pre-trained model is fully downloaded.
- CUDA Errors: If you encounter device-related issues, verify that your GPU drivers are up to date.
- Memory Errors: If the model crashes due to memory limits, consider reducing the batch size.
If problems persist, reach out for support and consult resources from fxis.ai. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Model Evaluation
The evaluation results demonstrate significant improvements before and after fine-tuning with benchmarks like STS and SICK-R, indicating the model’s effectiveness.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.