How to Use the Domain Classifier Model for Text Classification

Jun 24, 2024 | Educational

Are you looking to categorize documents into specific domains without breaking a sweat? Welcome to the world of the Domain Classifier model. This powerful text classification tool employs advanced techniques to conveniently sort text into one of 26 predefined domain classes. Letâ€™s dive in and see how you can leverage this model for your text classification tasks.

Model Overview

The Domain Classifier is designed to classify documents into 26 different categories, including:

Adult
Arts and Entertainment
Autos and Vehicles
Beauty and Fitness
Books and Literature
Business and Industrial
Computers and Electronics
Finance
Food and Drink
Games
Health
Hobbies and Leisure
Home and Garden
Internet and Telecom
Jobs and Education
Law and Government
News
Online Communities
People and Society
Pets and Animals
Real Estate
Science
Sensitive Subjects
Shopping
Sports
Travel and Transportation

Understanding the Model Architecture

This model is built on the Deberta V3 Base architecture, which processes a maximum context length of 512 tokens. Think of it as a highly organized librarian who can only remember a limited number of books at a time, but can quickly reference them to help you find the information you need!

Training Details

To provide accurate classifications, the model was trained on:

1 million Common Crawl samples labeled with the help of Google Cloudâ€™s Natural Language API: link
500k curated Wikipedia articles: link

Training involved multiple rounds using both of these sources, applying a combination of pseudolabels and the Google Cloud API for better accuracy.

How to Use This Model

Input Specifications

The model is ready to take one or many paragraphs of text as input. Here’s an example:

Directions
1. Mix 2 flours and baking powder together
2. Mix water and egg in a separate bowl. Add dry to wet little by little
3. Heat frying pan on medium
4. Pour batter into pan and then put blueberries on top before flipping
5. Top with desired toppings!

Expected Output

The model will output a predicted domain class for each input text. For example:

Food_and_Drink

Using the Model in NeMo Curator

To get started with this model in NeMo Curator, first check the inference code available on NeMo Curator’s GitHub repository. You can download the model.pth file and explore the example notebook for detailed instructions.

Using the Model in Transformers

Here’s a sample code snippet demonstrating how to utilize the Domain Classifier model:

import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig
from huggingface_hub import PyTorchModelHubMixin

class CustomModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, config):
        super(CustomModel, self).__init__()
        self.model = AutoModel.from_pretrained(config['base_model'])
        self.dropout = nn.Dropout(config['fc_dropout'])
        self.fc = nn.Linear(self.model.config.hidden_size, len(config['id2label']))

    def forward(self, input_ids, attention_mask):
        features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        dropped = self.dropout(features)
        outputs = self.fc(dropped)
        return torch.softmax(outputs[:, 0, :], dim=1)

# Setup configuration and model
config = AutoConfig.from_pretrained("nvidia/domain-classifier")
tokenizer = AutoTokenizer.from_pretrained("nvidia/domain-classifier")
model = CustomModel.from_pretrained("nvidia/domain-classifier")

# Prepare and process inputs
text_samples = ["Sports is a popular domain", "Politics is a popular domain"]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs['input_ids'], inputs['attention_mask'])

# Predict and display results
predicted_classes = torch.argmax(outputs, dim=1)
predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
print(predicted_domains)  # ['Sports', 'News']

Evaluation Benchmarks

The model’s performance is remarkable, achieving a PR-AUC score of 0.9873 across an evaluation set with 105k samples. Here’s a glimpse of the scores for different domains:

Domain	PR-AUC
Adult	0.999
Arts and Entertainment	0.997
Autos and Vehicles	0.997
Beauty and Fitness	0.997
Books and Literature	0.995
Business and Industrial	0.982
… (and more)	…

Troubleshooting Tips

If you encounter any issues while using the Domain Classifier model, here are some helpful suggestions:

Ensure that you’ve correctly installed necessary dependencies, especially PyTorch and Transformers libraries.
If you run into model compatibility errors, verify that you’re using the correct version of the model with its associated libraries.
When working with long texts, remember to keep your input within the 512-token limit.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By applying the techniques outlined above, you can easily set up and utilize the Domain Classifier model to classify text documents efficiently. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox