Welcome to the world of Natural Language Processing (NLP)! In this article, we will walk you through the process of fine-tuning transformer models for identifying trolling, aggression, and cyberbullying across multiple languages, as demonstrated at the TRAC 2020 workshop. Grab your coding gear, and let’s dive in!
Understanding the Framework
Imagine you are a gardener, nurturing different types of plants in your garden (languages), each requiring specific care (fine-tuning) to flourish. Just like a gardener needs tools (transformer models) to cultivate these plants effectively, we utilize specific transformer models for our NLP tasks. These models help us detect various kinds of harmful online behavior.
Getting Started: Usage Instructions
To begin using the models developed in our project, follow this straightforward code snippet:
python
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
import torch
from pathlib import Path
from scipy.special import softmax
import numpy as np
import pandas as pd
TASK_LABEL_IDS = {
'Sub-task A': ['OAG', 'NAG', 'CAG'],
'Sub-task B': ['GEN', 'NGEN'],
'Sub-task C': ['OAG-GEN', 'OAG-NGEN', 'NAG-GEN', 'NAG-NGEN', 'CAG-GEN', 'CAG-NGEN']
}
model_version = "databank" # other option is hugging face library
if model_version == "databank":
# Make sure you have downloaded the required model file from https://databank.illinois.edu/datasets/IDB-8882752
# Unzip the file at some model_path (we are using: databank_model)
model_path = next(Path(databank_model).glob('*output*model'))
lang, task, _, base_model, _ = model_path.parts
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
else:
lang, task, base_model = 'ALL', 'Sub-task C', 'bert-base-multilingual-uncased'
base_model = f'socialmediaie/TRAC2020_{lang.split()[-1]}_{base_model}'
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSequenceClassification.from_pretrained(base_model)
# For doing inference set model in eval mode
model.eval()
# If you want to further fine-tune the model you can reset the model to model.train()
task_labels = TASK_LABEL_IDS[task]
sentence = "This is a good cat and this is a bad dog."
processed_sentence = f"{tokenizer.cls_token} {sentence}"
tokens = tokenizer.tokenize(sentence)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])
with torch.no_grad():
logits, = model(tokens_tensor, labels=None)
preds = logits.detach().cpu().numpy()
preds_probs = softmax(preds, axis=1)
preds = np.argmax(preds_probs, axis=1)
preds_labels = np.array(task_labels)[preds]
print(dict(zip(task_labels, preds_probs[0])), preds_labels)
Breaking It Down: An Analogy for Understanding
The above code represents the process of training and using our models for classification, much like preparing a dish. Here’s how it works:
- Ingredients Gathering: First, we import necessary libraries, akin to gathering your ingredients before cooking.
- Choosing the Recipe: The model version acts like a choice of recipe—whether to use fresh ingredients (databank) or pre-packaged (Hugging Face).
- Prepping the Ingredients: The tokenizer is like chopping and preparing your ingredients to make sure they fit the dish you’re preparing.
- Cooking: Once the input sentence is processed into indexed tokens, it is akin to mixing your ingredients into a pot. The model processes this input to generate outputs (predictions).
- Tasting: Finally, just as you’d taste your dish to ensure it’s flavorful, you print the predictions and their probabilities to check how well your model performs.
Troubleshooting Tips
If you encounter issues during implementation, here are some troubleshooting steps:
- Ensure that all necessary libraries are installed and up-to-date.
- Double-check that your model paths are correctly set and that the models are downloaded properly.
- Verify that the sentence structure matches the model’s expectations; sometimes the input sentence may need adjustments.
- If results seem off, consider experimenting with different sentences to gauge the model’s performance more accurately.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
