A Python NLP Library for Persian

May 2, 2021 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_Dadmatech_DadmaTools

A Python NLP Library for Persian

Features

What is DadmaTools?

DadmaTools is a comprehensive toolkit aimed at making Natural Language Processing (NLP) in the Persian language more accessible for practitioners. The library provides a set of tools that supports key NLP tasks, featuring code examples that work seamlessly with popular frameworks like spaCy and Transformers, and deep learning frameworks such as PyTorch. The project even supports common Persian datasets and embeddings, making it a one-stop-shop for all your NLP needs in Persian.

Installation
NLP Models
Normalizer
Pipeline
Loading Persian NLP Datasets
Loading Persian Word Embeddings
Evaluation
How to use in Colab

Installation

To get started using DadmaTools in your Python project, you’ll first need to install it via pip. You can opt for the minimal installation to suit your needs.

Install with pip

Simply run the following command:

pip install dadmatools

By default, this installs some useful NLP libraries like SpaCy and supar. You can check the requirements.txt file to see all the package versions that have been tested.

Install from GitHub

If you want the latest version from GitHub, use:

pip install git+https://github.com/Dadmatech/dadmatools.git

NLP Models

DadmaTools encompasses various NLP tasks which can be executed through pipelines. Think of each task like a tour guide. If you wish to explore only a certain part of a city (NLP task), you can choose that specific guide without being burdened by the others:

Named Entity Recognition: ner
Part of Speech Tagging: pos
Dependency Parsing: dep
Constituency Parsing: cons
Chunking: chunk
Kasreh Ezafe Detection: kasreh
Spellchecker: spellchecker
Lemmatizing: lem
Tokenizing: tok
Informal to Formal: itf
Sentiment Analysis: sent

Note: The normalizer can be utilized outside of the pipeline as well.

Normalizer

The Normalizer is essential for cleaning and unifying text characters. The following is an example of how to use it:

from dadmatools.normalizer import Normalizer

normalizer = Normalizer(
    full_cleaning=False,
    unify_chars=True,
    refine_punc_spacing=True,
    remove_extra_space=True,
    remove_puncs=False,
    remove_html=False,
    remove_stop_word=False,
)

text = "پدادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده."
normalized_text = normalizer.normalize(text)
print(normalized_text)  # Output will be cleaned and unified text.

Pipeline

A pipeline helps to process multiple NLP tasks together. For example, if you want to tokenize, lemmatize, and perform POS tagging, you can do it in a single go:

import dadmatools.pipeline.language as language

pips = tok, lem, pos
nlp = language.Pipeline(pips)

# Document is an SpaCy object
doc = nlp("کشور بزرگ ایران توانسته در طی سال‌ها اغشار مختلفی از قومیت‌های گوناگون رو به خوبی تو خودش جا بده.")
print(doc)

Loading Persian NLP Datasets

DadmaTools facilitates easy loading of popular Persian NLP datasets. You can utilize these datasets like so:

from dadmatools.datasets import FarsTail

farstail = FarsTail()
print(len(farstail.train))  # Length of the training dataset
print(next(farstail.train))  # Example entry from the dataset

Loading Persian Word Embeddings

DadmaTools supports various embeddings like GloVe and FastText. To start using them:

from dadmatools.embeddings import get_embedding

word_embedding = get_embedding("glove-wiki")
print(word_embedding['سلام'])  # Get vector of the word "سلام"

Evaluation

We’ve benchmarked DadmaTools against leading tools like Stanza, showcasing impressive performance.

How to use in Colab

You can easily access the codes and output in Google Colab. Check it out here:

Troubleshooting

If you encounter any issues, here are a few troubleshooting steps:

Ensure that you have a compatible version of Python (preferably Python 3.6+).
Check if all required libraries are properly installed by reviewing your requirements.txt.
Make sure your Internet connection is stable when attempting to install via pip.
If using custom settings in Normalizer or pipelines, verify that the parameters provided are correct.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

A Python NLP Library for Persian

A Python NLP Library for Persian

Features

What is DadmaTools?

Contents

Installation

Install with pip

Install from GitHub

NLP Models

Normalizer

Pipeline

Loading Persian NLP Datasets

Loading Persian Word Embeddings

Evaluation

How to use in Colab

Troubleshooting

Conclusion

Let’s Build Success Together