How to Classify News Articles with the Interpress Dataset

Mar 13, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_15_1130

In this guide, we will explore how to utilize the Interpress News Classification dataset to build a robust model for categorizing news articles. This dataset offers a wealth of real-world data, optimized for machine learning tasks. So, strap in as we take a journey through data preprocessing, model implementation, and troubleshooting!

1. Getting Started with the Interpress Dataset

The dataset, initially containing 273K entries, has been filtered down to 108K for our model. For more information about this dataset, you can visit this link.

2. Model Efficiency

Our model boasts an accuracy of 97% on training and validation data, achieved through an 80% train and 20% validation split.

Classification Report: Evaluate model performance across categories.
Confusion Matrix: Visualize accurate versus incorrect predictions.

3. Installation of Necessary Libraries

You’ll want to install the Hugging Face Transformers library to leverage its models and tokenizer. Use the following command:

pip install transformers

4. How to Implement the Model

Now we’ll dive into the code that will help us classify news articles. Think of the model implementation as setting up a high-tech coffee machine:

You first need the right ingredients (libraries)
Next, you set it up (initialize the tokenizer and model)
Finally, you brew (perform the predictions)

4.1 Using Pytorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("serdarakyol/interpress-turkish-news-classification")
model = AutoModelForSequenceClassification.from_pretrained("serdarakyol/interpress-turkish-news-classification")

Note here how we initialize our tokenizer and model with a pre-trained version.

4.2 GPU Support

If you have a GPU, we will take advantage of it to speed up our computations. The following code checks for CUDA availability:


import torch
if torch.cuda.is_available():
    device = torch.device('cuda')
    model = model.cuda()
    print(f"There are {torch.cuda.device_count()} GPU(s) available.")
    print(f"GPU name is: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU available, using the CPU instead.")
    device = torch.device('cpu')

4.3 Making Predictions

Now, let’s make our predictions using the following function:


import numpy as np
def prediction(news):
    news = [news]
    indices = tokenizer.batch_encode_plus(
        news,
        max_length=512,
        add_special_tokens=True,
        return_attention_mask=True,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    inputs = indices['input_ids'].clone().detach().to(device)
    masks = indices['attention_mask'].clone().detach().to(device)
    with torch.no_grad():
        output = model(inputs, token_type_ids=None, attention_mask=masks)
    logits = output[0]
    logits = logits.detach().cpu().numpy()
    pred = np.argmax(logits, axis=1)[0]
    return pred

5. Example of Using the Model

Here’s a sample news headline you might want to classify:


news = "Beyaz Saray Sözcüsü Psaki, Muhammed bin Selmana yaptırım uygulamamanın doğru karar olduğunu savundu."
labels = {
    0: "Culture-Art",
    1: "Economy",
    2: "Politics",
    3: "Education",
    4: "World",
    5: "Sport",
    6: "Technology",
    7: "Magazine",
    8: "Health",
    9: "Agenda"
}
pred = prediction(news)
print(labels[pred])  # Should output "World"

6. Troubleshooting Tips

If you encounter any installation issues, ensure that your Python environment is up-to-date and the required packages are compatible.
For unexpected model behaviors or accuracy issues, reconsider the data split or check for imbalanced classes in your dataset.
If you’re facing GPU compatibility issues, ensure that you have the appropriate drivers installed for your hardware.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

7. Acknowledgments

Special thanks to @yavuzkomecoglu for their contributions that helped enhance this project.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox