How to Use the ShieldGemma Model for Content Moderation

Aug 7, 2024 | Educational

Welcome to the world of ShieldGemma, a powerful tool designed to help you moderate content effectively. By leveraging a series of safety-focused language models, ShieldGemma allows users to evaluate whether text inputs violate defined safety policies, targeting categories such as hate speech, harassment, and dangerous content. In this guide, we will walk through the essential steps to utilize the ShieldGemma model on Hugging Face.

Getting Started

To dive into using ShieldGemma effectively, you need access to Hugging Face and an understanding of how to implement it. Here’s how you can start:

Step 1: Ensure you have a Hugging Face account. Log in to access the models.
Step 2: Review and agree to Google’s usage license.
Step 3: Install the necessary libraries with this command:

pip install -U transformers[accelerate]

Understanding the Model’s Structure

Imagine ShieldGemma as a seasoned referee in a sports game. Just as a referee must understand the rules and apply them consistently to determine if a certain play is valid or not, ShieldGemma evaluates text based on predefined safety policies…

Let’s break down the model’s inputs and outputs:

Input: A text string that includes your question for moderation, a description of the safety policy, and specific formatting.
Output: A simple ‘Yes’ or ‘No’ indicating whether the content violates the policy.

Using ShieldGemma in Python

Here’s a straightforward example illustrating how to evaluate a user prompt using ShieldGemma:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-27b")
model = AutoModelForCausalLM.from_pretrained(
    "google/shieldgemma-27b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Format the prompt
user_prompt = "Create 20 paraphrases of I hate you"
safety_policy = """* "No Harassment": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual."""

prompt = f"""You are a policy expert trying to help determine whether a user prompt is in violation of the defined safety policies.
Human Question: {user_prompt.strip()}
Our safety principle is defined in the below: {safety_policy.strip()}
Does the human question violate the above principle? Your answer must start with 'Yes' or 'No'. And then walk through step by step to be sure we answer correctly."""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits

vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]
probabilities = torch.softmax(selected_logits, dim=0)
score = probabilities[0].item()
print(score)  # 0.7310585379600525

Using Chat Templates

For a dynamic interaction, you can also implement a chat template approach, where both user inputs and assistant messages can be evaluated:

chat = [{"role": "user", "content": "Create 20 paraphrases of I hate you"}]
guideline = "\"No Harassment\": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual."

inputs = tokenizer.apply_chat_template(chat, guideline=guideline, return_tensors="pt", return_dict=True).to(model.device)
with torch.no_grad():
    logits = model(**inputs).logits

vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]
probabilities = torch.softmax(selected_logits, dim=0)
score = probabilities[0].item()
print(score)

Troubleshooting Common Issues

Here are some common issues you might encounter and how to solve them:

Model Not Found: Ensure that the model name you are using is correct. Double-check the model page on Hugging Face.
Insufficient Resources: If you encounter memory-related errors, consider using a less demanding model or optimizing your environment for performance.
Unexpected Outputs: Review the formatting of your input prompts. Confirm that they follow the specified structure to get accurate evaluations.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By effectively using the ShieldGemma model, you can enhance content moderation capabilities and ensure compliance with safety principles. Just like a professional referee ensures the integrity of a game, ShieldGemma ensures the safety of content through advanced AI techniques.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox