A Python NLP Library for Persian
Features
Named Entity Recognition | Part of Speech Tagging | Dependency Parsing | Informal To FormalConstituency Parsing | Chunking | Kasreh Ezafe Detection
Spellchecker | Normalizer | Tokenizer | Lemmatizer | Sentiment Analysis
What is DadmaTools?
DadmaTools is a comprehensive toolkit aimed at making Natural Language Processing (NLP) in the Persian language more accessible for practitioners. The library provides a set of tools that supports key NLP tasks, featuring code examples that work seamlessly with popular frameworks like spaCy and Transformers, and deep learning frameworks such as PyTorch. The project even supports common Persian datasets and embeddings, making it a one-stop-shop for all your NLP needs in Persian.
Contents
- Installation
- NLP Models
- Normalizer
- Pipeline
- Loading Persian NLP Datasets
- Loading Persian Word Embeddings
- Evaluation
- How to use in Colab
Installation
To get started using DadmaTools in your Python project, you’ll first need to install it via pip. You can opt for the minimal installation to suit your needs.
Install with pip
Simply run the following command:
pip install dadmatools
By default, this installs some useful NLP libraries like SpaCy and supar. You can check the requirements.txt
file to see all the package versions that have been tested.
Install from GitHub
If you want the latest version from GitHub, use:
pip install git+https://github.com/Dadmatech/dadmatools.git
NLP Models
DadmaTools encompasses various NLP tasks which can be executed through pipelines. Think of each task like a tour guide. If you wish to explore only a certain part of a city (NLP task), you can choose that specific guide without being burdened by the others:
- Named Entity Recognition:
ner
- Part of Speech Tagging:
pos
- Dependency Parsing:
dep
- Constituency Parsing:
cons
- Chunking:
chunk
- Kasreh Ezafe Detection:
kasreh
- Spellchecker:
spellchecker
- Lemmatizing:
lem
- Tokenizing:
tok
- Informal to Formal:
itf
- Sentiment Analysis:
sent
Note: The normalizer
can be utilized outside of the pipeline as well.
Normalizer
The Normalizer is essential for cleaning and unifying text characters. The following is an example of how to use it:
from dadmatools.normalizer import Normalizer
normalizer = Normalizer(
full_cleaning=False,
unify_chars=True,
refine_punc_spacing=True,
remove_extra_space=True,
remove_puncs=False,
remove_html=False,
remove_stop_word=False,
)
text = "پدادماتولز اولین نسخش سال ۱۴۰۰ منتشر شده."
normalized_text = normalizer.normalize(text)
print(normalized_text) # Output will be cleaned and unified text.
Pipeline
A pipeline helps to process multiple NLP tasks together. For example, if you want to tokenize, lemmatize, and perform POS tagging, you can do it in a single go:
import dadmatools.pipeline.language as language
pips = tok, lem, pos
nlp = language.Pipeline(pips)
# Document is an SpaCy object
doc = nlp("کشور بزرگ ایران توانسته در طی سالها اغشار مختلفی از قومیتهای گوناگون رو به خوبی تو خودش جا بده.")
print(doc)
Loading Persian NLP Datasets
DadmaTools facilitates easy loading of popular Persian NLP datasets. You can utilize these datasets like so:
from dadmatools.datasets import FarsTail
farstail = FarsTail()
print(len(farstail.train)) # Length of the training dataset
print(next(farstail.train)) # Example entry from the dataset
Loading Persian Word Embeddings
DadmaTools supports various embeddings like GloVe and FastText. To start using them:
from dadmatools.embeddings import get_embedding
word_embedding = get_embedding("glove-wiki")
print(word_embedding['سلام']) # Get vector of the word "سلام"
Evaluation
We’ve benchmarked DadmaTools against leading tools like Stanza, showcasing impressive performance.
How to use in Colab
You can easily access the codes and output in Google Colab. Check it out here:
Troubleshooting
If you encounter any issues, here are a few troubleshooting steps:
- Ensure that you have a compatible version of Python (preferably Python 3.6+).
- Check if all required libraries are properly installed by reviewing your
requirements.txt
. - Make sure your Internet connection is stable when attempting to install via pip.
- If using custom settings in Normalizer or pipelines, verify that the parameters provided are correct.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.