How to Use the Piccolo Model for Chinese Text Embedding

Sep 11, 2023 | Educational

Piccolo is a robust general embedding model designed for the Chinese language, developed by the General Model Group at SenseTime Research. This guide will walk you through the process of using the Piccolo model, its training methodology, and provide troubleshooting tips to ensure a seamless experience.

Understanding the Piccolo Model

Imagine you want to teach a computer how to understand human language as if it were like teaching a child to recognize various shapes from a box of blocks. Piccolo uses a two-stage approach to learn the intricacies of the Chinese language:

  • First Stage: Just like exposing a child to many different shapes, Piccolo started with 400 million weakly supervised Chinese text pairs sourced from the internet. The model uses softmax contrastive loss to learn from these pairs.
  • Second Stage: To refine its understanding, Piccolo reviewed about 20 million human-labeled text pairs, enabling it to learn from correct examples and difficult cases, much like guiding a child through complex puzzles with both correct and incorrect options available.

Model Specifications

Piccolo offers two variations:

  • piccolo-base-zh: Size: 0.2 GB, Dimensions: 768, Sequence Length: 512
  • piccolo-large-zh: Size: 0.65 GB, Dimensions: 1024, Sequence Length: 512

Using Piccolo with Sentence-Transformer

To implement the Piccolo model using the sentence-transformers package, follow these steps:

# For short-to-short dataset
from sentence_transformers import SentenceTransformer

sentences = ["数据1", "数据2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

For short-to-long datasets, you may want to include instructions to help the model with retrieval:

# For short-to-long dataset
from sentence_transformers import SentenceTransformer

queries = ["query_1", "query_2"]
passages = ["doc_1", "doc_2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
q_embeddings = model.encode([f"查询: {q}" for q in queries], normalize_embeddings=True)
p_embeddings = model.encode([f"结果: {p}" for p in passages], normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T

Training Details

For training Piccolo, there are specific configurations to consider:

  • Pretraining: A maximum length of 128 is recommended to enhance batch size and reduce memory usage, using binary contrastive loss.
  • Fine-tuning: The maximum length extends to 512 for larger text inputs, utilizing a triple contrastive loss approach that includes hard negative samples.

Troubleshooting

If you encounter issues while using the Piccolo model, consider the following troubleshooting tips:

  • Ensure your Python environment has the necessary packages installed, such as sentence-transformers.
  • If you face memory issues, experiment with reducing the batch size or leveraging mixed precision training techniques.
  • If you’re not getting accurate embeddings, examine the input data to ensure it follows the expected formatting.

For deeper insights, updates, or collaboration on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox