How to Detect Anglicisms in Spanish Text with mBERT

May 20, 2022 | Educational

Are you interested in identifying unassimilated English lexical borrowings in Spanish texts? If so, the anglicisms-spanish-mbert model can help! In this guide, we’ll walk you through the process of using this pretrained model to detect anglicisms in Spanish news articles.

What is the anglicisms-spanish-mbert Model?

The anglicisms-spanish-mbert model is a sophisticated tool designed for detecting unassimilated English lexical borrowings commonly used in Spanish. Examples of such borrowings include fake news, machine learning, smartwatch, and influencer. The model employs a fine-tuned version of multilingual BERT trained on the COALAS corpus to achieve this.

How to Use the Model

Using the anglicisms-spanish-mbert model is straightforward with just a few setup steps. Follow these instructions to get started:

Step 1: Install Required Libraries

You will need to have the transformers library installed. You can install it using pip:

pip install transformers

Step 2: Load the Model

Here’s how to load the model:

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

Step 3: Input Your Text

Now you’re ready to analyze a text for anglicisms! Here’s a quick example:

example = "Buscamos data scientist para proyecto de machine learning."
borrowings = nlp(example)
print(borrowings)

This code will return a list of detected anglicisms in the provided example.

Understanding the Output

The model provides output in the form of entities identified as English lexical borrowings or borrowings from other languages. Each label is crucial for understanding how foreign terms are utilized in your text.

Performance Metrics

The model achieves the following results on the test set from the COALAS corpus:

LABEL    Precision   Recall    F1
ALL      88.09       79.46     83.55
ENG      88.44       82.16     85.19
OTHER   37.5        6.52      11.11

Troubleshooting & Tips

If you encounter any issues while using this model, here are some helpful troubleshooting ideas:

  • Library Version: Ensure you’re using the latest version of the transformers library. Outdated versions may lead to compatibility issues.
  • Internet Connection: This model requires downloading the model files from Hugging Face. A stable internet connection is necessary.
  • Pretrained Models: If you wish to achieve better performance, consider trying out the Flair model which has shown superior results (F1=85.76).

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Utilizing the anglicisms-spanish-mbert model is an effective way to analyze Spanish texts for anglicisms. By following the steps outlined above, you can identify foreign lexical borrowings, enhancing your understanding of language use in contemporary media.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox