How to Use the Russian Texts Statistics Library (ruTS)

Category :

The Russian Texts Statistics library, or ruTS, is a powerful tool for extracting various statistics from Russian-language texts. Whether you’re a developer or a researcher, this library offers a structured way to analyze text data. This guide will walk you through the installation and basic usage of the library, complete with examples.

Installation

To get started with ruTS, you’ll need to install it. Open your terminal or command prompt and run the following command:

$ pip install ruts

Dependencies

You must ensure you have the following dependencies:

  • Python versions 3.8-3.10
  • nltk
  • pymorphy2
  • razdel
  • scipy
  • spaCy
  • numpy
  • pandas
  • matplotlib
  • graphviz

Basic Functionality

The core functionality of ruTS is built on adapted statistics from the textacy library for the Russian language. The library allows you to work both directly with texts and prepared Doc objects from spaCy. You can get familiar with the available functions through the API documentation: API Documentation.

Extracting Objects from Text

ruTS enables you to create custom tools for extracting sentences and words from the text, which can then be used for statistical calculations. Here’s an analogy to understand how extraction works:

Imagine you’re a librarian searching through a giant library filled with books. Each book is a text, and you want to find particular sentences (like titles) and specific words (like authors). With the extraction tools in ruTS, you’re effectively organizing the chaotic shelves into manageable sections that you can analyze.

Here’s an example of how to use the extractors:

from nltk.corpus import stopwords
from ruts import SentsExtractor, WordsExtractor

text = "Не имей 100 рублей, а имей 100 друзей"

# Sentence Extractor
se = SentsExtractor(tokenizer=re.compile(r, ))
extracted_sentences = se.extract(text)
print(extracted_sentences)  # (Не имей 100 рублей, а имей 100 друзей)

# Words Extractor
we = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True, ngram_range=(1, 2))
extracted_words = we.extract(text)
print(extracted_words)  # (иметь, рубль, иметь, друг, ...)
print(we.get_most_common(3))  # [(иметь, 2), (рубль, 1), (друг, 1)]

Generating Basic Statistics

The library can extract various statistical indicators from the text, such as:

  • Number of sentences
  • Number of words
  • Count of unique words
  • And many more…

Here’s how to generate basic statistics:

from ruts import BasicStats

text = "Существуют три вида лжи: ложь, наглая ложь и статистика"
bs = BasicStats(text)
bs.get_stats()
bs.print_stats()

Readability Metrics

The library can compute metrics to evaluate the readability of a text. These include:

  • Flesch-Kincaid Test
  • Flesch Reading Ease Index
  • SMOG Index
  • And more…

Example usage for readability metrics:

from ruts import ReadabilityStats

text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
rs = ReadabilityStats(text)
rs.get_stats()
rs.print_stats()

Data Sets Availability

ruTS allows you to work with preprocessed datasets, including:

  • sov_chrest_lit – Soviet literature textbooks
  • stalin_works – Complete works of I.V. Stalin

For instance, to access the Soviet literature dataset:

from ruts.datasets import SovChLit
sc = SovChLit()
print(sc.info)

Visualization

The library supports text visualization through:

  • Zipf’s Law
  • Literary Fingerprinting
  • Word Trees

Example for Zipf’s Law visualization:

from collections import Counter
from ruts import WordsExtractor
from ruts.datasets import SovChLit
from ruts.visualizers import zipf

sc = SovChLit()
text = "\n".join([text for text in sc.get_texts(limit=100)])
we = WordsExtractor(use_lexemes=True)
tokens_with_count = Counter(we.extract(text))
zipf(tokens_with_count, num_words=100, num_labels=10)

Troubleshooting

If you encounter issues while using the ruTS library, consider the following troubleshooting tips:

  • Ensure all dependencies are installed and compatible with your Python version.
  • Check if you are using correctly formatted text inputs.
  • Refer to the API documentation for any missing functions or classes.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

Latest Insights

© 2024 All Rights Reserved

×