Multilingual Joint Fine-tuning for Identifying Trolling, Aggression, and Cyberbullying

Sep 13, 2024 | Educational

In an age driven by online communication, understanding user behavior has become crucial. This guide will walk you through implementing multilingual joint fine-tuning of transformer models that identify various forms of online misconduct, such as trolling, aggression, and cyberbullying, specifically as discussed in the TRAC 2020 workshop.

Overview

The approach is detailed in the paper by Mishra et al. (2020), which outlines how transformer models can be fine-tuned to analyze and classify different types of aggressive communications across languages. The models can be accessed and implemented via a code repository available at GitHub.

Getting Started with the Code

To make things easier, we’ll follow a streamlined process to set up and use the models. Here’s an organized breakdown:

  • Install the required libraries, primarily `transformers` from Hugging Face and `torch`.
  • Download the model from the designated data bank.
  • Set up your coding environment to use the models for inference.

Implementing the Model

Now, let’s delve into the code implementation which is structured to perform the essential functions required for model training and inference:

python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

# Define Task Labels
TASK_LABEL_IDS = {
    "Sub-task A": ["OAG", "NAG", "CAG"],
    "Sub-task B": ["GEN", "NGEN"],
    "Sub-task C": ["OAG-GEN", "OAG-NGEN", "NAG-GEN", "NAG-NGEN", "CAG-GEN", "CAG-NGEN"]
}

model_version = "databank"  # Switch to hugging face library as needed

if model_version == "databank":
    # Ensure the required model file is downloaded and unzipped
    model_path = next(Path(databank_model).glob('*output*model'))
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = "ALL", "Sub-task C", "bert-base-multilingual-uncased"
    base_model = fsocialmediaielang_lang.split()[-1] + "_base_model"
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

# Setting the model to evaluation mode
model.eval()
task_labels = TASK_LABEL_IDS[task]

sentence = "This is a good cat and this is a bad dog"
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
    logits, = model(tokens_tensor, labels=None)
    
preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(task_labels)[preds]
print(dict(zip(task_labels, preds_probs[0])), preds_labels)

Explaining the Code: An Analogy

Think of fine-tuning these models like training a dog to respond to specific commands. You have a well-trained dog (the base transformer model) that already knows how to perform various tricks (analyze data). However, to make the dog better suited for your specific needs (identifying trolling or aggression), you need to teach it new commands (fine-tuning).

The setup process essentially involves:

  • Deciding which commands (model tasks) you want to teach (identify different forms of online aggression).
  • Preparing the environment (code and libraries) so the dog can learn comfortably.
  • Using vocal commands (input sentences) for training sessions, ensuring the dog (model) understands what you want.

Troubleshooting Ideas

Should you encounter issues during setup or execution, consider these troubleshooting steps:

  • Ensure all necessary packages are installed and up-to-date.
  • Check the file paths specified in your code to ensure the model is being correctly accessed.
  • Remember that model performance might vary if using different model versions or datasets.
  • If performance seems off, revisit the training dataset quality, as this can significantly affect outcomes.
  • For deeper issues, seek help from the community or the documentation available on Hugging Face.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with [fxis.ai](https://fxis.ai).

Concluding Thoughts

With the information provided, you should be well-versed in the basics of multilingual joint fine-tuning of transformer models. The implementation can lead to more effective means of identifying online aggression, making digital spaces safer. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox