Are you looking to categorize documents into specific domains without breaking a sweat? Welcome to the world of the Domain Classifier model. This powerful text classification tool employs advanced techniques to conveniently sort text into one of 26 predefined domain classes. Let’s dive in and see how you can leverage this model for your text classification tasks.
Model Overview
The Domain Classifier is designed to classify documents into 26 different categories, including:
- Adult
- Arts and Entertainment
- Autos and Vehicles
- Beauty and Fitness
- Books and Literature
- Business and Industrial
- Computers and Electronics
- Finance
- Food and Drink
- Games
- Health
- Hobbies and Leisure
- Home and Garden
- Internet and Telecom
- Jobs and Education
- Law and Government
- News
- Online Communities
- People and Society
- Pets and Animals
- Real Estate
- Science
- Sensitive Subjects
- Shopping
- Sports
- Travel and Transportation
Understanding the Model Architecture
This model is built on the Deberta V3 Base architecture, which processes a maximum context length of 512 tokens. Think of it as a highly organized librarian who can only remember a limited number of books at a time, but can quickly reference them to help you find the information you need!
Training Details
To provide accurate classifications, the model was trained on:
- 1 million Common Crawl samples labeled with the help of Google Cloud’s Natural Language API: link
- 500k curated Wikipedia articles: link
Training involved multiple rounds using both of these sources, applying a combination of pseudolabels and the Google Cloud API for better accuracy.
How to Use This Model
Input Specifications
The model is ready to take one or many paragraphs of text as input. Here’s an example:
Directions
1. Mix 2 flours and baking powder together
2. Mix water and egg in a separate bowl. Add dry to wet little by little
3. Heat frying pan on medium
4. Pour batter into pan and then put blueberries on top before flipping
5. Top with desired toppings!
Expected Output
The model will output a predicted domain class for each input text. For example:
Food_and_Drink
Using the Model in NeMo Curator
To get started with this model in NeMo Curator, first check the inference code available on NeMo Curator’s GitHub repository. You can download the model.pth file and explore the example notebook for detailed instructions.
Using the Model in Transformers
Here’s a sample code snippet demonstrating how to utilize the Domain Classifier model:
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig
from huggingface_hub import PyTorchModelHubMixin
class CustomModel(nn.Module, PyTorchModelHubMixin):
def __init__(self, config):
super(CustomModel, self).__init__()
self.model = AutoModel.from_pretrained(config['base_model'])
self.dropout = nn.Dropout(config['fc_dropout'])
self.fc = nn.Linear(self.model.config.hidden_size, len(config['id2label']))
def forward(self, input_ids, attention_mask):
features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
dropped = self.dropout(features)
outputs = self.fc(dropped)
return torch.softmax(outputs[:, 0, :], dim=1)
# Setup configuration and model
config = AutoConfig.from_pretrained("nvidia/domain-classifier")
tokenizer = AutoTokenizer.from_pretrained("nvidia/domain-classifier")
model = CustomModel.from_pretrained("nvidia/domain-classifier")
# Prepare and process inputs
text_samples = ["Sports is a popular domain", "Politics is a popular domain"]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs['input_ids'], inputs['attention_mask'])
# Predict and display results
predicted_classes = torch.argmax(outputs, dim=1)
predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
print(predicted_domains) # ['Sports', 'News']
Evaluation Benchmarks
The model’s performance is remarkable, achieving a PR-AUC score of 0.9873 across an evaluation set with 105k samples. Here’s a glimpse of the scores for different domains:
| Domain | PR-AUC |
|---|---|
| Adult | 0.999 |
| Arts and Entertainment | 0.997 |
| Autos and Vehicles | 0.997 |
| Beauty and Fitness | 0.997 |
| Books and Literature | 0.995 |
| Business and Industrial | 0.982 |
| … (and more) | … |
Troubleshooting Tips
If you encounter any issues while using the Domain Classifier model, here are some helpful suggestions:
- Ensure that you’ve correctly installed necessary dependencies, especially PyTorch and Transformers libraries.
- If you run into model compatibility errors, verify that you’re using the correct version of the model with its associated libraries.
- When working with long texts, remember to keep your input within the 512-token limit.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
By applying the techniques outlined above, you can easily set up and utilize the Domain Classifier model to classify text documents efficiently. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

