How to Use the Russian Texts Statistics Library (ruTS)

Jul 18, 2024 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_SergeyShk_ruTS

The Russian Texts Statistics library, or ruTS, is a powerful tool for extracting various statistics from Russian-language texts. Whether you’re a developer or a researcher, this library offers a structured way to analyze text data. This guide will walk you through the installation and basic usage of the library, complete with examples.

Installation

To get started with ruTS, you’ll need to install it. Open your terminal or command prompt and run the following command:

$ pip install ruts

Dependencies

You must ensure you have the following dependencies:

Python versions 3.8-3.10
nltk
pymorphy2
razdel
scipy
spaCy
numpy
pandas
matplotlib
graphviz

Basic Functionality

The core functionality of ruTS is built on adapted statistics from the textacy library for the Russian language. The library allows you to work both directly with texts and prepared Doc objects from spaCy. You can get familiar with the available functions through the API documentation: API Documentation.

Extracting Objects from Text

ruTS enables you to create custom tools for extracting sentences and words from the text, which can then be used for statistical calculations. Here’s an analogy to understand how extraction works:

Imagine you’re a librarian searching through a giant library filled with books. Each book is a text, and you want to find particular sentences (like titles) and specific words (like authors). With the extraction tools in ruTS, you’re effectively organizing the chaotic shelves into manageable sections that you can analyze.

Here’s an example of how to use the extractors:

from nltk.corpus import stopwords
from ruts import SentsExtractor, WordsExtractor

text = "Не имей 100 рублей, а имей 100 друзей"

# Sentence Extractor
se = SentsExtractor(tokenizer=re.compile(r, ))
extracted_sentences = se.extract(text)
print(extracted_sentences)  # (Не имей 100 рублей, а имей 100 друзей)

# Words Extractor
we = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True, ngram_range=(1, 2))
extracted_words = we.extract(text)
print(extracted_words)  # (иметь, рубль, иметь, друг, ...)
print(we.get_most_common(3))  # [(иметь, 2), (рубль, 1), (друг, 1)]

Generating Basic Statistics

The library can extract various statistical indicators from the text, such as:

Number of sentences
Number of words
Count of unique words
And many more…

Here’s how to generate basic statistics:

from ruts import BasicStats

text = "Существуют три вида лжи: ложь, наглая ложь и статистика"
bs = BasicStats(text)
bs.get_stats()
bs.print_stats()

Readability Metrics

The library can compute metrics to evaluate the readability of a text. These include:

Flesch-Kincaid Test
Flesch Reading Ease Index
SMOG Index
And more…

Example usage for readability metrics:

from ruts import ReadabilityStats

text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
rs = ReadabilityStats(text)
rs.get_stats()
rs.print_stats()

Data Sets Availability

ruTS allows you to work with preprocessed datasets, including:

sov_chrest_lit – Soviet literature textbooks
stalin_works – Complete works of I.V. Stalin

For instance, to access the Soviet literature dataset:

from ruts.datasets import SovChLit
sc = SovChLit()
print(sc.info)

Visualization

The library supports text visualization through:

Zipf’s Law
Literary Fingerprinting
Word Trees

Example for Zipf’s Law visualization:

from collections import Counter
from ruts import WordsExtractor
from ruts.datasets import SovChLit
from ruts.visualizers import zipf

sc = SovChLit()
text = "\n".join([text for text in sc.get_texts(limit=100)])
we = WordsExtractor(use_lexemes=True)
tokens_with_count = Counter(we.extract(text))
zipf(tokens_with_count, num_words=100, num_labels=10)

Troubleshooting

If you encounter issues while using the ruTS library, consider the following troubleshooting tips:

Ensure all dependencies are installed and compatible with your Python version.
Check if you are using correctly formatted text inputs.
Refer to the API documentation for any missing functions or classes.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox