Multilingual Joint Fine-tuning of Transformer Models for Cyberbullying Detection

Sep 11, 2024 | Educational

In the age of digital communication, identifying negative online behavior such as trolling, aggression, and cyberbullying is crucial. This blog will guide you through the process of utilizing transformer models for this purpose, based on the findings from the TRAC 2020 workshop.

Getting Started

Our research paper titled “Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020” outlines our methodology. You can access the code repository and trained models from the following links:

Usage of the Models

The following Python code demonstrates how to use our trained models. We’ll walk through it using an analogy: think of the model as a skilled chef preparing a dish (text classification) based on specific ingredients (input sentences).


from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

# Defining labels
TASK_LABEL_IDS = {
    'Sub-task A': ['OAG', 'NAG', 'CAG'],
    'Sub-task B': ['GEN', 'NGEN'],
    'Sub-task C': ['OAG-GEN', 'OAG-NGEN', 'NAG-GEN', 'NAG-NGEN', 'CAG-GEN', 'CAG-NGEN']
}

model_version = 'databank'  # Change to 'huggingface' if using Hugging Face library
if model_version == 'databank':
    model_path = next(Path('databank_model').glob('*.output/model'))
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = 'ALL', 'Sub-task C', 'bert-base-multilingual-uncased'
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

model.eval()  # Set model to evaluation mode

# Load your sentence
sentence = "This is a good cat and this is a bad dog"
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(processed_sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

# Performing inference
with torch.no_grad():
    logits, = model(tokens_tensor, labels=None)

preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(TASK_LABEL_IDS[task])[preds]
print(dict(zip(TASK_LABEL_IDS[task], preds_probs[0])), preds_labels)

In our analogy, the ‘model’ represents a chef trained in various cuisines (transformer architectures), while the ‘tokens’ signify ingredients vital for the dish (input text). Just like any good chef, our model must first practice (train) to understand how to combine these ingredients effectively before producing a delectable dish (classification result).

Troubleshooting Tips

If you encounter issues while using the models, consider the following solutions:

  • Ensure all dependencies are installed. If you’re missing libraries, install them using pip.
  • Double-check that the model path is correctly specified and the model files are present.
  • For errors related to memory, try running your script with a smaller batch size.
  • Review the sentence formats and ensure they are properly tokenized before inference.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Following these instructions will equip you to fine-tune models effectively to identify toxic online behavior. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox