How to Use Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression, and Cyberbullying

Sep 13, 2024 | Educational

In the age of digital communication, the necessity for robust systems to detect negative interactions, such as trolling, aggression, and cyberbullying, has become crucial. This tutorial walks you through the steps to fine-tune transformer models for detecting these harmful behaviors utilizing the techniques described in the paper by Mishra et al. (2020) from the TRAC 2020 workshop. Let’s dive into how you can leverage these models to enhance online safety.

Pre-requisites

Before we commence, ensure you have the following:

  • Python installed on your system.
  • Access to a suitable dataset for training if you desire custom fine-tuning.
  • Basic understanding of Python and deep learning concepts.

Dataset Availability

We have made various models and evaluation metrics available at the following links:

Using the Models

Here’s how you can utilize these transformer models in your project:

python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

TASK_LABEL_IDS = {
    "Sub-task A": ["OAG", "NAG", "CAG"],
    "Sub-task B": ["GEN", "NGEN"],
    "Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"]
}

model_version = "databank"  # other option is hugging face library

if model_version == "databank":
    # Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
    # Unzip the file at some model_path (we are using: databank_model)
    model_path = next(Path(databank_model).glob("*.output/*model"))
    
    # Assuming you get the following type of structure inside databank_model
    # databank_model/ALL/Sub-task C/output/bert-base-multilingual-uncased/model
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased"
    base_model = "fsocialmediaie/TRAC2020_" + lang.split()[-1] + "_base_model"
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

# For doing inference set model in eval mode
model.eval()

# If you want to further fine-tune the model you can reset it to model.train()
task_labels = TASK_LABEL_IDS[task]
sentence = "This is a good cat and this is a bad dog."
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(processed_sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
    logits, = model(tokens_tensor, labels=None)

preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(task_labels)[preds]

print(dict(zip(task_labels, preds_probs[0])), preds_labels)

Understanding the Code: A Simple Analogy

Think of the code as a recipe for baking a cake. Here’s how it unfolds:

  • The ingredients you gather (libraries and datasets) determine the kind of cake (model) you’ll make.
  • The instructions (code block) guide you step-by-step on how to prepare your cake. Just like you would layer your ingredients in a specific order, you load your libraries and set up the model.
  • Once everything is prepped, you’ve to bake at the right temperature (running inference) to get your cake to rise properly, much like how your model processes input to yield predictions.
  • Lastly, tasting the cake (outputs) helps you decide if it’s ready or if you need to tweak your recipe for improvement (fine-tuning the model) to make it more delicious (accurate).

Troubleshooting

If you run into issues while loading models or need additional configurations, here are some troubleshooting tips:

  • Ensure all required libraries are properly installed and up-to-date.
  • Check if your dataset path is correctly referenced in the model loading section.
  • Verify that your Python interpreter is compatible with the required libraries.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following this guide, you now have the tools to effectively utilize multilingual transformer models to combat trolling, aggression, and cyberbullying. This contribution is a step towards making online platforms safer and enhancing positive interactions.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox