How to Use Twitter-roBERTa-base for Offensive Language Identification

Aug 20, 2024 | Educational

In the age of social media, identifying offensive language in tweets is an essential task. With the TweetEval benchmark and the robust Tweeteval official repository, the twitter-roBERTa-base model trained on around 58 million tweets makes this easier than ever. Let’s explore how to utilize this model effectively!

Getting Started

The following are the steps you’ll need to take to set up the model for offensive language identification:

Installing the Required Libraries
Loading the Model and Tokenizer
Preprocessing the Text
Making Predictions

1. Installing the Required Libraries

Ensure that you have the Hugging Face Transformers library installed. You can do this by running:

pip install transformers numpy scipy

2. Loading the Model and Tokenizer

First, you need to load the pre-trained model and tokenizer. Here’s how you can set this up:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

task = 'offensive'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

3. Preprocessing the Text

Before feeding the text into the model, it needs to be preprocessed. Think of this step as prepping the ingredients before cooking a dish. Here’s how to create a function for preprocessing:

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

4. Making Predictions

Once you’ve preprocessed the text, it’s time to make predictions. Here’s how this works:

Think of the model as a keen judge evaluating the behavior of tweets. The model will return scores for offensive and not-offensive classifications, similar to how a judge might score each participant in a competition. Here’s the complete code for making predictions:

import numpy as np
from scipy.special import softmax
import urllib.request
import csv

# Download label mapping
labels = []
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
    labels = [row[1] for row in csvreader if len(row) > 1]

# Text input
text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')

# Model predictions
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# Ranking predictions
ranking = np.argsort(scores)[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Understanding the Output

After running the code, the output might look like this:

1) not-offensive 0.9073
2) offensive 0.0927

The output indicates the model categorizes the text primarily as “not-offensive” with a score of 0.9073, while it sees a smaller likelihood of it being “offensive.” The higher the score, the more confident the model is in its prediction.

Troubleshooting

If the model doesn’t seem to perform correctly or encounter errors, here are some troubleshooting tips:

Ensure the libraries are correctly installed and updated to the latest version.
Check your internet connection, as some functions require downloading resources.
Verify that the text being processed doesn’t contain unsupported characters or formats.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

As social media continues to grow, so does the importance of efficient and accurate language identification. By using the twitter-roBERTa-base model effectively, we can tackle offensive language detection head-on. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox