Welcome to this guide on using the roberta-base-ca-cased-sts model, a powerful tool for assessing semantic textual similarity in the Catalan language. In this article, we’ll walk through its functionalities, usage, potential issues, and how to troubleshoot effectively.
Table of Contents
- Model Description
- Intended Uses and Limitations
- How to Use
- Limitations and Bias
- Training
- Evaluation
- Additional Information
Model Description
The roberta-base-ca-cased-sts is a fine-tuned model for Semantic Textual Similarity (STS) specifically designed for the Catalan language. It builds upon the RoBERTa base model, pre-trained using a medium-sized corpus collected from publicly available data.
Intended Uses and Limitations
This model can be employed to determine the similarity between two text snippets. However, it is important to note that its performance is constrained by the dataset it was trained on, which may not be universally applicable for all scenarios.
How to Use
To obtain the model’s prediction scores, varying between 0.0 and 5.0, follow the steps in this code:
from transformers import pipeline, AutoTokenizer
from scipy.special import logit
model = "projecte-ainaroberta-base-ca-cased-sts"
tokenizer = AutoTokenizer.from_pretrained(model)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
def prepare(sentence_pairs):
sentence_pairs_prep = []
for s1, s2 in sentence_pairs:
sentence_pairs_prep.append(tokenizer.cls_token + s1 + tokenizer.sep_token + tokenizer.sep_token + s2 + tokenizer.sep_token)
return sentence_pairs_prep
sentence_pairs = [
("El llibre va caure per la finestra.", "El llibre va sortir volant."),
("Magrades.", "Testimo."),
("Magrada el sol i la calor", "A la Garrotxa plou molt.")
]
predictions = pipe(prepare(sentence_pairs), add_special_tokens=False)
for prediction in predictions:
prediction['score'] = logit(prediction['score'])
print(predictions)
The above code initiates the model and processes text pairs to provide their similarity scores. Think of this as a chef assembling ingredients (text pairs) before cooking (calculating scores) them together.
Limitations and Bias
Currently, no bias assessment measures have been implemented for the model. Be aware that biases may exist, as the training corpus was collected from diverse web sources. Future updates aim to address these bias concerns.
Training
Training was conducted using the STS dataset in Catalan known as STS-ca. The model was trained with a batch size of 16 and a learning rate of 5e-5 over 5 epochs.
Evaluation
The model was evaluated based on its Pearson and Spearman correlation scores against standard multilingual and monolingual baselines, yielding a Pearson score of approximately 0.797.
Additional Information
If you wish to reach out for further information or collaboration, please contact the Text Mining Unit (TeMU) at the Barcelona Supercomputing Center.
Troubleshooting
If you encounter any issues while using the model, consider the following troubleshooting ideas:
- Ensure all necessary libraries are installed and correctly configured, especially transformers and scipy.
- Double-check model names and ensure that they are correctly spelled in the code.
- Experiment with different sentence pairs to ensure the model’s adaptability.
- If predictions don’t seem reasonable, it could be due to the model’s training dataset. Remember, it may not generalize well.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

