How to Use the Domain Classifier Model for Text Classification

Jun 24, 2024 | Educational

Are you looking to categorize documents into specific domains without breaking a sweat? Welcome to the world of the Domain Classifier model. This powerful text classification tool employs advanced techniques to conveniently sort text into one of 26 predefined domain classes. Let’s dive in and see how you can leverage this model for your text classification tasks.

Model Overview

The Domain Classifier is designed to classify documents into 26 different categories, including:

  • Adult
  • Arts and Entertainment
  • Autos and Vehicles
  • Beauty and Fitness
  • Books and Literature
  • Business and Industrial
  • Computers and Electronics
  • Finance
  • Food and Drink
  • Games
  • Health
  • Hobbies and Leisure
  • Home and Garden
  • Internet and Telecom
  • Jobs and Education
  • Law and Government
  • News
  • Online Communities
  • People and Society
  • Pets and Animals
  • Real Estate
  • Science
  • Sensitive Subjects
  • Shopping
  • Sports
  • Travel and Transportation

Understanding the Model Architecture

This model is built on the Deberta V3 Base architecture, which processes a maximum context length of 512 tokens. Think of it as a highly organized librarian who can only remember a limited number of books at a time, but can quickly reference them to help you find the information you need!

Training Details

To provide accurate classifications, the model was trained on:

  • 1 million Common Crawl samples labeled with the help of Google Cloud’s Natural Language API: link
  • 500k curated Wikipedia articles: link

Training involved multiple rounds using both of these sources, applying a combination of pseudolabels and the Google Cloud API for better accuracy.

How to Use This Model

Input Specifications

The model is ready to take one or many paragraphs of text as input. Here’s an example:

Directions
1. Mix 2 flours and baking powder together
2. Mix water and egg in a separate bowl. Add dry to wet little by little
3. Heat frying pan on medium
4. Pour batter into pan and then put blueberries on top before flipping
5. Top with desired toppings!

Expected Output

The model will output a predicted domain class for each input text. For example:

Food_and_Drink

Using the Model in NeMo Curator

To get started with this model in NeMo Curator, first check the inference code available on NeMo Curator’s GitHub repository. You can download the model.pth file and explore the example notebook for detailed instructions.

Using the Model in Transformers

Here’s a sample code snippet demonstrating how to utilize the Domain Classifier model:

import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig
from huggingface_hub import PyTorchModelHubMixin

class CustomModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, config):
        super(CustomModel, self).__init__()
        self.model = AutoModel.from_pretrained(config['base_model'])
        self.dropout = nn.Dropout(config['fc_dropout'])
        self.fc = nn.Linear(self.model.config.hidden_size, len(config['id2label']))

    def forward(self, input_ids, attention_mask):
        features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        dropped = self.dropout(features)
        outputs = self.fc(dropped)
        return torch.softmax(outputs[:, 0, :], dim=1)

# Setup configuration and model
config = AutoConfig.from_pretrained("nvidia/domain-classifier")
tokenizer = AutoTokenizer.from_pretrained("nvidia/domain-classifier")
model = CustomModel.from_pretrained("nvidia/domain-classifier")

# Prepare and process inputs
text_samples = ["Sports is a popular domain", "Politics is a popular domain"]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs['input_ids'], inputs['attention_mask'])

# Predict and display results
predicted_classes = torch.argmax(outputs, dim=1)
predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
print(predicted_domains)  # ['Sports', 'News']

Evaluation Benchmarks

The model’s performance is remarkable, achieving a PR-AUC score of 0.9873 across an evaluation set with 105k samples. Here’s a glimpse of the scores for different domains:

Domain PR-AUC
Adult0.999
Arts and Entertainment0.997
Autos and Vehicles0.997
Beauty and Fitness0.997
Books and Literature0.995
Business and Industrial0.982
… (and more)

Troubleshooting Tips

If you encounter any issues while using the Domain Classifier model, here are some helpful suggestions:

  • Ensure that you’ve correctly installed necessary dependencies, especially PyTorch and Transformers libraries.
  • If you run into model compatibility errors, verify that you’re using the correct version of the model with its associated libraries.
  • When working with long texts, remember to keep your input within the 512-token limit.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By applying the techniques outlined above, you can easily set up and utilize the Domain Classifier model to classify text documents efficiently. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox