How to Use the RuBERT Tiny Model for Sentence Compression

Sep 13, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_14_70

In today’s digital age, the ability to effectively summarize information is crucial. Enter the RuBERT Tiny model, a cutting-edge tool designed for sentence compression, also known as extractive sentence summarization. In this blog, we’ll explore how to utilize this model for your own text compression needs, complete with troubleshooting tips for a smooth experience.

What is Sentence Compression?

Sentence compression aims to reduce the length of a sentence while retaining its core meaning. This is particularly valuable in various applications like content summarization, chatbots, or even in simplifying complex texts. However, do keep in mind that the output might be ungrammatical, yet still useful.

Getting Started with RuBERT Tiny

To get started, you’ll need to install the required libraries and load the appropriate model. Here’s how the process works:

python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_name = 'cointegrated/rubert-tiny2-sentence-compression'
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def compress(text, threshold=0.5, keep_ratio=None):
    Compress a sentence by removing the least important words.
    Parameters:
        threshold: cutoff for predicted probabilities of word removal
        keep_ratio: proportion of words to preserve
    By default, a threshold of 0.5 is used.
    
    with torch.inference_mode():
        tok = tokenizer(text, return_tensors='pt').to(model.device)
        proba = torch.softmax(model(**tok).logits, -1).cpu().numpy()[0, :, 1]

    if keep_ratio is not None:
        threshold = sorted(proba)[int(len(proba) * keep_ratio)]
        
    kept_toks = []
    keep = False
    prev_word_id = None
    
    for word_id, score, token in zip(tok.word_ids(), proba, tok.input_ids[0]):
        if word_id is None:
            keep = True
        elif word_id != prev_word_id:
            keep = score > threshold
        if keep:
            kept_toks.append(token)
        prev_word_id = word_id
        
    return tokenizer.decode(kept_toks, skip_special_tokens=True)

text = "Кроме того, можно взять идею, рожденную из сердца, и выразить ее в рамках одной из этих структур, без потери искренности идеи и смысла песни."
print(compress(text))
print(compress(text, threshold=0.3))
print(compress(text, threshold=0.1))
print(compress(text, keep_ratio=0.5))

So, let’s explain this code with an analogy: imagine you are a chef preparing a delicious dish. You have a recipe (input sentence) with numerous ingredients (words), and your task is to serve it up in the most scrumptious way possible while still retaining the essential flavors (meaning). The RuBERT Tiny model acts like a picky eater, deciding which ingredients can be removed without ruining the taste of the dish. The ‘threshold’ dictates how choosy you want to be; a higher value means keeping more ingredients, while a lower one allows you to minimize the dish further. Similarly, ‘keep_ratio’ allows you to preserve a specific portion of your recipe.

Example Uses

Below are some practical examples of how to use this model in your code:

Compressing a sentence with default settings.
Tweaking the threshold for different levels of compression.
Using the keep_ratio parameter to keep a specific fraction of the words.

Common Troubleshooting Tips

If you encounter any issues while using the RuBERT Tiny model, consider the following troubleshooting ideas:

Problem: The model doesn’t load properly.
Solution: Ensure you have the necessary libraries installed and updated to their latest versions.
Problem: The output seems illogical or ungrammatical.
Solution: Remember that the model prioritizes meaning retention over grammatical accuracy. Adjust the threshold or keep_ratio for better results.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Conclusion

With the RuBERT Tiny model, you can make your text succinct without losing its essence. By following the guidelines outlined in this article, you will be well-equipped to implement this powerful model into your text processing tasks.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox