This article will walk you through how to use the HateBERTimbau-YouTube model, a state-of-the-art transformer-based model designed to identify hate speech in Portuguese social media text.
What is HateBERTimbau-YouTube?
HateBERTimbau-YouTube is a fine-tuned variant of the HateBERTimbau model, specifically retrained on a dataset of 23,912 YouTube comments that focus on recognizing and classifying hate speech.
Features of the Model
- Developed by: kNOwHATE: kNOwing online HATE speech
- Funded by: European Union
- Language: Portuguese
Getting Started with HateBERTimbau-YouTube
To use the model, you can easily integrate it into your Python environment. The two common approaches are demonstrated below — using the model directly with a pipeline for text classification or fine-tuning it for specific tasks.
Using the Model Directly
To classify hate speech using the pre-trained model, you can do so with a few lines of code. Think of this as checking the temperature of a dish before serving it to determine if it’s too hot to handle:
python
from transformers import pipeline
classifier = pipeline("text-classification", model="knowhate/HateBERTimbau-youtube")
result = classifier("as pessoas tem que perceber que ser panasca não é deixar de ser homem, é deixar de ser humano")
print(result)
The classifier will return the label and score indicating whether the text contains hate speech.
Fine-tuning for Specific Tasks
If you require more tailored performance, you can fine-tune the model on your own dataset. Think of this as bringing a chef into your kitchen to adapt their recipe to better suit your taste:
python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-youtube")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-youtube")
dataset = load_dataset("knowhate/youtube-train")
def tokenize_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="hatebertimbau",
evaluation_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
This code will help you train the model with your specific dataset by defining how data is tokenized and the training parameters.
Troubleshooting Tips
- If you encounter issues loading the model, ensure you have the latest version of the Transformers library installed.
- If the model gives an unexpected classification, check the format of your input data to ensure it matches the required structure.
- For more robust results, consider increasing the size of your training dataset.
- Need further assistance or looking to collaborate? For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Results of Model Testing
The model showed promising performance on the test dataset, achieving:
- Precision: 0.856
- Recall: 0.892
- F1-Score: 0.874
By utilizing HateBERTimbau-YouTube, you can effectively contribute to online safety and awareness, combatting hate speech in an increasingly digital world.

