DataQA is a powerful tool for labeling and exploring unstructured documents effectively using rules-based weak supervision.
DataQA comes packed with features to cater to various document processing needs. From searching through your documents using the powerful Elasticsearch engine to classifying your data and extracting entities, it streamlines the workflow, significantly reducing the number of labels required compared to other tools. Best of all, you can install it using a simple pip command!
Installation
Pre-requisites:
- Python 3.6, 3.7, 3.8, or 3.9
- Recommended to start in a new Python virtual environment
- Update your pip using
pip install -U pip - Tested on: MacOSX, Ubuntu, and on browsers: Chrome, Firefox
Installing from PyPI:
pip install dataqa
To run with Docker:
docker run -d -p 5000:5000 dataqa/dataqa
To keep data between runs, use docker start [container-id] and docker stop [container-id].
Usage
To initiate the tool, open your terminal and type the following command:
dataqa run
Wait a few moments for the server to start up. This will launch a local server and open a browser window at port 5000. If the browser does not open automatically, you can manually navigate to localhost:5000.
Keep the terminal open! To quit the application, press Ctrl-C. To resume the application later, just type dataqa run again, and it will generate a folder at $HOME/.dataqa_data.
Uploading Data:
Your data must be a CSV file encoded in UTF-8, with a maximum size of 30MB and a column named text containing the primary content. Other columns will be ignored. This will trigger an analysis process that might take up to 5 minutes.
Uninstallation:
To uninstall DataQA, simply type:
dataqa uninstall
This command will remove the local application data in the .dataqa_data directory and will prompt for confirmation before deletion. To also remove the package, use:
pip uninstall dataqa
What is weak supervision and why does it work?
Weak supervision is a collection of strategies aimed at creating noisy labels from vast amounts of data. This method has surged in popularity due to the massive datasets often required for machine learning systems. Think of it like preparing a special meal using various ingredients — some might not taste perfect, but as a chef, you know how to balance them to achieve the best outcome. Similarly, rules encoded by the annotator (the chef in this analogy) allow the algorithm to learn how to weigh noisy data signals and extract meaningful patterns.
Documentation
For detailed documentation on how to utilize DataQA for different tasks, visit: DataQA Documentation.
For multi-class classification problems, check here. For named entity recognition, visit here. To explore named entity linking, head here.
Troubleshooting
If your project data doesn’t load, try accessing the homepage at localhost:5000 and navigate to your project from there. Additionally, running dataqa test can provide more details about any errors you encounter. Feedback and bug reports are very welcome!
To test the application, you may upload a text file that includes a column labeled LABEL. This will allow you to see the ground-truth labels during the labeling process and check real performance metrics displayed in the performance table between brackets.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
