Piccolo is a robust general embedding model designed for the Chinese language, developed by the General Model Group at SenseTime Research. This guide will walk you through the process of using the Piccolo model, its training methodology, and provide troubleshooting tips to ensure a seamless experience.
Understanding the Piccolo Model
Imagine you want to teach a computer how to understand human language as if it were like teaching a child to recognize various shapes from a box of blocks. Piccolo uses a two-stage approach to learn the intricacies of the Chinese language:
- First Stage: Just like exposing a child to many different shapes, Piccolo started with 400 million weakly supervised Chinese text pairs sourced from the internet. The model uses softmax contrastive loss to learn from these pairs.
- Second Stage: To refine its understanding, Piccolo reviewed about 20 million human-labeled text pairs, enabling it to learn from correct examples and difficult cases, much like guiding a child through complex puzzles with both correct and incorrect options available.
Model Specifications
Piccolo offers two variations:
- piccolo-base-zh: Size: 0.2 GB, Dimensions: 768, Sequence Length: 512
- piccolo-large-zh: Size: 0.65 GB, Dimensions: 1024, Sequence Length: 512
Using Piccolo with Sentence-Transformer
To implement the Piccolo model using the sentence-transformers package, follow these steps:
# For short-to-short dataset
from sentence_transformers import SentenceTransformer
sentences = ["数据1", "数据2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
For short-to-long datasets, you may want to include instructions to help the model with retrieval:
# For short-to-long dataset
from sentence_transformers import SentenceTransformer
queries = ["query_1", "query_2"]
passages = ["doc_1", "doc_2"]
model = SentenceTransformer('sensenova/piccolo-base-zh')
q_embeddings = model.encode([f"查询: {q}" for q in queries], normalize_embeddings=True)
p_embeddings = model.encode([f"结果: {p}" for p in passages], normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
Training Details
For training Piccolo, there are specific configurations to consider:
- Pretraining: A maximum length of 128 is recommended to enhance batch size and reduce memory usage, using binary contrastive loss.
- Fine-tuning: The maximum length extends to 512 for larger text inputs, utilizing a triple contrastive loss approach that includes hard negative samples.
Troubleshooting
If you encounter issues while using the Piccolo model, consider the following troubleshooting tips:
- Ensure your Python environment has the necessary packages installed, such as sentence-transformers.
- If you face memory issues, experiment with reducing the batch size or leveraging mixed precision training techniques.
- If you’re not getting accurate embeddings, examine the input data to ensure it follows the expected formatting.
For deeper insights, updates, or collaboration on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

