How to Use the Spanish News Classification Model

September 7, 2021

In the landscape of natural language processing, classifying news headlines can be critical for understanding and organizing information quickly. Here, we unravel how to utilize the Spanish News Classification model developed by M47Labs, designed to categorize headlines into specific tags.

Understanding the Model

This model is based on BETO, which is a Spanish adaption of BERT, refined through a meticulous fine-tuning process involving 1000 example datasets. Its aim? To classify texts into various sectors like politics, economics, and sports. Below are some of the classifications you can expect:

ciencia_tecnologia
clickbait
cultura
deportes
economia
educacion
medio_ambiente
opinion
politica
sociedad

How to Implement the Model

Now, let’s dive into the implementation process. You can think of running this model like trying to classify fruits at a market. Each fruit (news headline) needs to be evaluated and placed in the correct basket (category).

Example of Use

The following example helps illustrate how to set up your pipeline for text classification:

import torch
from transformers import AutoTokenizer, BertForSequenceClassification, TextClassificationPipeline

review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
path = "M47Labs/spanish_news_classification_headlines"

tokenizer = AutoTokenizer.from_pretrained(path)
model = BertForSequenceClassification.from_pretrained(path)
nlp = TextClassificationPipeline(task="text-classification", model=model, tokenizer=tokenizer)

print(nlp(review_text))

Using PyTorch for In-depth Classification

If you prefer a more hands-on approach with PyTorch, you can follow this example, allowing deeper control over how headlines are processed. Just like carefully inspecting each fruit before placing it in the exact basket, you have the flexibility to monitor every detail.

import torch
from transformers import AutoTokenizer, BertForSequenceClassification, TextClassificationPipeline
from numpy import np

model_name = 'M47Labs/spanish_news_classification_headlines'
MAX_LEN = 32

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
encoded_review = tokenizer.encode_plus(
    texto,
    max_length=MAX_LEN,
    add_special_tokens=True,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
)

input_ids = encoded_review['input_ids']
attention_mask = encoded_review['attention_mask']
output = model(input_ids, attention_mask)

_, prediction = torch.max(output['logits'], dim=1)

print(f'Review text: {texto}')
print(f'Sentiment  : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')

Training and Validation Results

Once set up, you can see how well the model performs over various epochs, much like monitoring how well your fruit baskets are organized and labeled after several iterations of sorting.

Troubleshooting

If you encounter issues while implementing the Spanish News Classification headlines model, here are some troubleshooting tips:

Ensure that your Python packages are updated to their latest versions, especially the transformers library.
Check your internet connection, as the model and tokenizer need to be downloaded from the Hugging Face model hub.
For memory issues, consider using a machine with higher computational capability or optimizing your input dataset for smaller batch sizes.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.