Data profiling has taken a giant leap forward with Desbordante, a high-performance data profiler capable of discovering and validating complex data patterns. Whether you’re working in science, business, or machine learning, Desbordante can turn the way you interact with your data upside down. This blog post serves to guide you through installing, utilizing, and troubleshooting this powerful tool.
What is Desbordante?
Desbordante is designed to help you define and discover different patterns in your datasets through two primary tasks:
- Discovery: This task identifies instances of specific patterns within a dataset.
- Validation: Unlike discovery, this task checks if a specified pattern instance exists and provides detailed feedback on conflicts in datasets.
Additionally, the tool features dynamic algorithms that allow the table structure to change after a result has been found, resulting in faster computations than static algorithms, thus saving time and computational resources.
Installation Guide
To start using Desbordante, follow these installation steps.
Step 1: Prerequisites
- Python version 3.7 or higher.
- GNU GCC (version 10 and above), CMake (version 3.13 and above), and Boost library (version 1.81.0 and above) are also required.
Step 2: Installation
Run the following command to install Desbordante:
pip install desbordante
Note: If you face issues due to the C++ core, consider building it from the source following the specific instructions provided.
Using Desbordante
Desbordante can be accessed via three main interfaces:
- Console application: Here you can run command-line queries for simple pattern discovery and validation.
- Python bindings: Directly run Desbordante within Python programs, which allows for preprocessing data using popular libraries like pandas.
- Web application: An interactive web interface designed for data profiling with a focus on discovery and validation tasks.
Getting Started with Code
Let’s imagine you are a librarian, and each book in your library represents a pattern in a dataset. The Discovery feature is akin to helping you find specific types of books (say mystery novels or history books) quickly. Validation, on the other hand, focuses on verifying if a particular book (or pattern instance) can be found and what the reasons might be if it’s not present (like it’s checked out or misplaced).
Here’s how you can discover exact functional dependencies in your dataset using Python:
import desbordante
TABLE = 'path_to_your_table.csv'
algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print("FDs:")
for fd in result:
print(fd)
Troubleshooting Common Issues
Even with great tools, challenges can arise. Here are some common issues and solutions:
- Error downloading datasets: If you encounter a “Smudge error” while cloning the repo, ensure you set the following environment variable before retrying:
export GIT_LFS_SKIP_SMUDGE=1
pip install desbordante-stubs
Conclusion
Understanding Desbordante might take some initial effort, especially when navigating through its various patterns and features. But once you grasp the concepts, it becomes a powerful ally in data profiling across different domains.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

