The Russian Texts Statistics library, or ruTS, is a powerful tool for extracting various statistics from Russian-language texts. Whether you’re a developer or a researcher, this library offers a structured way to analyze text data. This guide will walk you through the installation and basic usage of the library, complete with examples.
Installation
To get started with ruTS, you’ll need to install it. Open your terminal or command prompt and run the following command:
$ pip install ruts
Dependencies
You must ensure you have the following dependencies:
- Python versions 3.8-3.10
- nltk
- pymorphy2
- razdel
- scipy
- spaCy
- numpy
- pandas
- matplotlib
- graphviz
Basic Functionality
The core functionality of ruTS is built on adapted statistics from the textacy library for the Russian language. The library allows you to work both directly with texts and prepared Doc objects from spaCy. You can get familiar with the available functions through the API documentation: API Documentation.
Extracting Objects from Text
ruTS enables you to create custom tools for extracting sentences and words from the text, which can then be used for statistical calculations. Here’s an analogy to understand how extraction works:
Imagine you’re a librarian searching through a giant library filled with books. Each book is a text, and you want to find particular sentences (like titles) and specific words (like authors). With the extraction tools in ruTS, you’re effectively organizing the chaotic shelves into manageable sections that you can analyze.
Here’s an example of how to use the extractors:
from nltk.corpus import stopwords
from ruts import SentsExtractor, WordsExtractor
text = "Не имей 100 рублей, а имей 100 друзей"
# Sentence Extractor
se = SentsExtractor(tokenizer=re.compile(r, ))
extracted_sentences = se.extract(text)
print(extracted_sentences) # (Не имей 100 рублей, а имей 100 друзей)
# Words Extractor
we = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True, ngram_range=(1, 2))
extracted_words = we.extract(text)
print(extracted_words) # (иметь, рубль, иметь, друг, ...)
print(we.get_most_common(3)) # [(иметь, 2), (рубль, 1), (друг, 1)]
Generating Basic Statistics
The library can extract various statistical indicators from the text, such as:
- Number of sentences
- Number of words
- Count of unique words
- And many more…
Here’s how to generate basic statistics:
from ruts import BasicStats
text = "Существуют три вида лжи: ложь, наглая ложь и статистика"
bs = BasicStats(text)
bs.get_stats()
bs.print_stats()
Readability Metrics
The library can compute metrics to evaluate the readability of a text. These include:
- Flesch-Kincaid Test
- Flesch Reading Ease Index
- SMOG Index
- And more…
Example usage for readability metrics:
from ruts import ReadabilityStats
text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
rs = ReadabilityStats(text)
rs.get_stats()
rs.print_stats()
Data Sets Availability
ruTS allows you to work with preprocessed datasets, including:
- sov_chrest_lit – Soviet literature textbooks
- stalin_works – Complete works of I.V. Stalin
For instance, to access the Soviet literature dataset:
from ruts.datasets import SovChLit
sc = SovChLit()
print(sc.info)
Visualization
The library supports text visualization through:
- Zipf’s Law
- Literary Fingerprinting
- Word Trees
Example for Zipf’s Law visualization:
from collections import Counter
from ruts import WordsExtractor
from ruts.datasets import SovChLit
from ruts.visualizers import zipf
sc = SovChLit()
text = "\n".join([text for text in sc.get_texts(limit=100)])
we = WordsExtractor(use_lexemes=True)
tokens_with_count = Counter(we.extract(text))
zipf(tokens_with_count, num_words=100, num_labels=10)
Troubleshooting
If you encounter issues while using the ruTS library, consider the following troubleshooting tips:
- Ensure all dependencies are installed and compatible with your Python version.
- Check if you are using correctly formatted text inputs.
- Refer to the API documentation for any missing functions or classes.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.