In this guide, we will explore how to utilize the Interpress News Classification dataset to build a robust model for categorizing news articles. This dataset offers a wealth of real-world data, optimized for machine learning tasks. So, strap in as we take a journey through data preprocessing, model implementation, and troubleshooting!
1. Getting Started with the Interpress Dataset
The dataset, initially containing 273K entries, has been filtered down to 108K for our model. For more information about this dataset, you can visit this link.
2. Model Efficiency
Our model boasts an accuracy of 97% on training and validation data, achieved through an 80% train and 20% validation split.
- Classification Report: Evaluate model performance across categories.
- Confusion Matrix: Visualize accurate versus incorrect predictions.
3. Installation of Necessary Libraries
You’ll want to install the Hugging Face Transformers library to leverage its models and tokenizer. Use the following command:
pip install transformers
4. How to Implement the Model
Now we’ll dive into the code that will help us classify news articles. Think of the model implementation as setting up a high-tech coffee machine:
- You first need the right ingredients (libraries)
- Next, you set it up (initialize the tokenizer and model)
- Finally, you brew (perform the predictions)
4.1 Using Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("serdarakyol/interpress-turkish-news-classification")
model = AutoModelForSequenceClassification.from_pretrained("serdarakyol/interpress-turkish-news-classification")
Note here how we initialize our tokenizer and model with a pre-trained version.
4.2 GPU Support
If you have a GPU, we will take advantage of it to speed up our computations. The following code checks for CUDA availability:
import torch
if torch.cuda.is_available():
device = torch.device('cuda')
model = model.cuda()
print(f"There are {torch.cuda.device_count()} GPU(s) available.")
print(f"GPU name is: {torch.cuda.get_device_name(0)}")
else:
print("No GPU available, using the CPU instead.")
device = torch.device('cpu')
4.3 Making Predictions
Now, let’s make our predictions using the following function:
import numpy as np
def prediction(news):
news = [news]
indices = tokenizer.batch_encode_plus(
news,
max_length=512,
add_special_tokens=True,
return_attention_mask=True,
padding='max_length',
truncation=True,
return_tensors='pt'
)
inputs = indices['input_ids'].clone().detach().to(device)
masks = indices['attention_mask'].clone().detach().to(device)
with torch.no_grad():
output = model(inputs, token_type_ids=None, attention_mask=masks)
logits = output[0]
logits = logits.detach().cpu().numpy()
pred = np.argmax(logits, axis=1)[0]
return pred
5. Example of Using the Model
Here’s a sample news headline you might want to classify:
news = "Beyaz Saray Sözcüsü Psaki, Muhammed bin Selmana yaptırım uygulamamanın doğru karar olduğunu savundu."
labels = {
0: "Culture-Art",
1: "Economy",
2: "Politics",
3: "Education",
4: "World",
5: "Sport",
6: "Technology",
7: "Magazine",
8: "Health",
9: "Agenda"
}
pred = prediction(news)
print(labels[pred]) # Should output "World"
6. Troubleshooting Tips
- If you encounter any installation issues, ensure that your Python environment is up-to-date and the required packages are compatible.
- For unexpected model behaviors or accuracy issues, reconsider the data split or check for imbalanced classes in your dataset.
- If you’re facing GPU compatibility issues, ensure that you have the appropriate drivers installed for your hardware.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
7. Acknowledgments
Special thanks to @yavuzkomecoglu for their contributions that helped enhance this project.