Multilingual Joint Fine-tuning of Transformer Models: A Guide to Identifying Trolling, Aggression, and Cyberbullying

Sep 13, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_18_1149

Welcome to our comprehensive guide on utilizing transformer models for identifying trolling, aggression, and cyberbullying. We’ll walk you through how to effectively fine-tune these models and interpret the results, with a specific focus on the work presented at the TRAC 2020 workshop.

Understanding the Transformer Approach

Imagine you’re teaching a multilingual student how to identify different emotional tones in various languages. Each language’s complexities are akin to the differing datasets used to train transformer models. Just as you would adjust your teaching style based on the student’s language background, these models require fine-tuning to discern the nuances of trolling, aggression, and cyberbullying across multiple languages.

The code we’ll be discussing helps you fine-tune these transformers on your dataset of choice, making them effective in varying contexts, much like a teacher who adapts their lessons to best suit their students.

Step-by-Step Guide to Using the Models

To effectively use the pre-trained models, follow these steps:

Set Up Your Environment Ensure you have Python and the required libraries installed, especially transformers.
Download Models You can access our trained models at the University of Illinois Databank and some on the Hugging Face repository.
Use the Provided Code Execute the following code snippet to begin working with the models:

python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

TASK_LABEL_IDS = {
  'Sub-task A': ['OAG', 'NAG', 'CAG'],
  'Sub-task B': ['GEN', 'NGEN'],
  'Sub-task C': ['OAG-GEN', 'OAG-NGEN', 'NAG-GEN', 'NAG-NGEN', 'CAG-GEN', 'CAG-NGEN']
}

model_version = 'databank' # or 'huggingface'

if model_version == 'databank':
    # Ensure you have downloaded and unzipped the model file from https://databank.illinois.edu/datasets/IDB-8882752
    model_path = next(Path(databank_model).glob('*.output/model'))
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = 'ALL', 'Sub-task C', 'bert-base-multilingual-uncased'
    base_model = f'socialmediaie/TRAC2020_{lang.split()[-1]}'
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

# Set the model to evaluation mode
model.eval()
# Further fine-tuning
model.train()
sentence = "This is a good cat and this is a bad dog."
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(processed_sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
    logits, = model(tokens_tensor, labels=None)
preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(TASK_LABEL_IDS[task])[preds]
print(dict(zip(TASK_LABEL_IDS[task], preds_probs[0])), preds_labels)

Interpreting the Output

Once your code runs smoothly, the output will show the probability of each class. Each label corresponds to a type of aggression:

CAG-GEN: General cyber-aggressor
NAG-GEN: General non-aggressor
OAG-GEN: General offline aggressor
CAG-NGEN: Non-generalized cyber-aggressor
NAG-NGEN: Non-generalized non-aggressor
OAG-NGEN: Non-generalized offline aggressor

Troubleshooting Common Issues

Here are some common issues you may encounter while working with this framework:

Model Not Found: Ensure that you have downloaded and unzipped the model files correctly.
Import Errors: Make sure all required packages, particularly transformers and torch, are installed in your environment.
Incompatible Tensor Dimensions: Check that your input sentence’s tokenization matches the model’s expected input shape.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

Understanding the subtleties of trolling, aggression, and cyberbullying through multilingual transformer models opens up new avenues for research and intervention. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox