How to Use sahajBERT for News Article Classification

Jun 18, 2021 | Educational

The world of Natural Language Processing (NLP) is dynamic, and with tools like sahajBERT, we can easily classify news articles into various categories. This model is designed explicitly for Bengali language content, making it a great asset for the Bengali-speaking community.

Model Overview

Responding to the need for efficient news article classification, the sahajBERT model has been fine-tuned to categorize articles into different classes. It leverages the sna.bn split of the IndicGlue dataset, achieving remarkable performance. Here’s how it classifies articles:

  • 0 – Kolkata
  • 1 – State
  • 2 – National
  • 3 – Sports
  • 4 – Entertainment
  • 5 – International

How to Use sahajBERT

Using sahajBERT for sequence classification is straightforward. Below is a step-by-step guide to get you started:

Step 1: Installation

Make sure you have the transformers library installed. If not, you can install it via pip:

pip install transformers

Step 2: Initialize the Model

Now, let’s dive into the code. Here’s how to initialize the tokenizer, model, and pipeline:

from transformers import AlbertForSequenceClassification, TextClassificationPipeline, PreTrainedTokenizerFast

# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT-NCC")

# Initialize model
model = AlbertForSequenceClassification.from_pretrained("neuropark/sahajBERT-NCC")

# Initialize pipeline
pipeline = TextClassificationPipeline(tokenizer=tokenizer, model=model)

# Input text for classification
raw_text = "এই ইউনিয়নে ৩ টি মৌজা ও ১০ টি গ্রাম আছে ।"  # Change me

# Get output
output = pipeline(raw_text)

Analogy: Unpacking the Process

Think of sahajBERT as a well-trained librarian in a vast library of news articles. When you hand over a piece of text, the librarian quickly skims through their knowledge and classifies the article based on previously established categories. In this case, you’re feeding a piece of raw text—much like a book title—and the librarian (sahajBERT) sorts it into “Kolkata,” “National,” or “Sports,” among others, based on context.

Troubleshooting Tips

If you encounter issues while using sahajBERT, consider the following:

  • Error in loading model: Ensure that the model name is spelled correctly and internet access is available to download the model.
  • Unexpected output: Check your input text for any unusual characters or phrases that may confuse the model.
  • Performance issues: If the model appears slow, consider testing on a machine with better computing resources or ensure that your code isn’t running in an overly taxing environment.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Performance Evaluation

Once you have classified your articles, you may want to know how well sahajBERT performs. Here’s a quick look at its evaluation results:

  • Loss: 0.2477
  • Accuracy: 92.63%
  • Macro F1: 0.9080
  • Recall: 92.63%
  • Macro Precision: 0.9110

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox