Mastering Multilingual Joint Fine-Tuning of Transformer Models for Trolling Detection

Sep 12, 2024 | Educational

Welcome to the world of Natural Language Processing (NLP)! In this article, we will walk you through the process of fine-tuning transformer models for identifying trolling, aggression, and cyberbullying across multiple languages, as demonstrated at the TRAC 2020 workshop. Grab your coding gear, and let’s dive in!

Understanding the Framework

Imagine you are a gardener, nurturing different types of plants in your garden (languages), each requiring specific care (fine-tuning) to flourish. Just like a gardener needs tools (transformer models) to cultivate these plants effectively, we utilize specific transformer models for our NLP tasks. These models help us detect various kinds of harmful online behavior.

Getting Started: Usage Instructions

To begin using the models developed in our project, follow this straightforward code snippet:

python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd

TASK_LABEL_IDS = {
    'Sub-task A': ['OAG', 'NAG', 'CAG'],
    'Sub-task B': ['GEN', 'NGEN'],
    'Sub-task C': ['OAG-GEN', 'OAG-NGEN', 'NAG-GEN', 'NAG-NGEN', 'CAG-GEN', 'CAG-NGEN']
}

model_version = "databank"  # other option is hugging face library

if model_version == "databank":
    # Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
    # Unzip the file at some model_path (we are using: databank_model)
    model_path = next(Path(databank_model).glob('*output*model'))
    lang, task, _, base_model, _ = model_path.parts
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
    lang, task, base_model = 'ALL', 'Sub-task C', 'bert-base-multilingual-uncased'
    base_model = f'socialmediaie/TRAC2020_{lang.split()[-1]}_{base_model}'
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(base_model)

# For doing inference set model in eval mode
model.eval()

# If you want to further fine-tune the model you can reset the model to model.train()
task_labels = TASK_LABEL_IDS[task]
sentence = "This is a good cat and this is a bad dog."
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
    logits, = model(tokens_tensor, labels=None)
preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(task_labels)[preds]
print(dict(zip(task_labels, preds_probs[0])), preds_labels)

Breaking It Down: An Analogy for Understanding

The above code represents the process of training and using our models for classification, much like preparing a dish. Here’s how it works:

Ingredients Gathering: First, we import necessary libraries, akin to gathering your ingredients before cooking.
Choosing the Recipe: The model version acts like a choice of recipe—whether to use fresh ingredients (databank) or pre-packaged (Hugging Face).
Prepping the Ingredients: The tokenizer is like chopping and preparing your ingredients to make sure they fit the dish you’re preparing.
Cooking: Once the input sentence is processed into indexed tokens, it is akin to mixing your ingredients into a pot. The model processes this input to generate outputs (predictions).
Tasting: Finally, just as you’d taste your dish to ensure it’s flavorful, you print the predictions and their probabilities to check how well your model performs.

Troubleshooting Tips

If you encounter issues during implementation, here are some troubleshooting steps:

Ensure that all necessary libraries are installed and up-to-date.
Double-check that your model paths are correctly set and that the models are downloaded properly.
Verify that the sentence structure matches the model’s expectations; sometimes the input sentence may need adjustments.
If results seem off, consider experimenting with different sentences to gauge the model’s performance more accurately.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox