Closed Domain Question Answering (cdQA) is an innovative system designed to answer questions within a specific context or domain. Built on the robust HuggingFace transformers library, cdQA allows users to train models to effectively extract relevant information. This guide outlines the installation, data preparation, model training, and troubleshooting approaches for cdQA.
Table of Contents
Installation
With pip
Use pip to install cdQA easily:
pip install cdqa
From source
To install from the source, clone the repository and install:
git clone https://github.com/cdqa-suite/cdQA.git
cd cdQA
pip install -e .
Hardware Requirements
You can run cdQA on:
- CPU: AWS EC2 t2.medium Deep Learning AMI (Ubuntu) Version 22.0
- GPU: AWS EC2 p3.2xlarge Deep Learning AMI (Ubuntu) Version 22.0 + a single Tesla V100 16GB
Getting Started
Preparing Your Data
Manual
To utilize cdQA, create a pandas DataFrame with the required columns:
title paragraphs
----------------- ------------------------------------------------------
The Article Title [Paragraph 1 of Article, ... , Paragraph N of Article]
With converters
Use converters to automatically create a dataframe from your documents. For example, utilize the pdf_converter:
from cdqa.utils.converters import pdf_converter
df = pdf_converter(directory_path=path_to_pdf_folder)
Ensure you have Java OpenJDK installed to use converters.
Downloading Pre-trained Models and Data
Download models and datasets using:
from cdqa.utils.download import download_squad, download_model, download_bnpp_data
directory = path_to_directory
download_squad(dir=directory)
download_bnpp_data(dir=directory)
download_model(bert-squad_1.1, dir=directory)
Training Models
Fit the pipeline with:
import pandas as pd
from ast import literal_eval
from cdqa.pipeline import QAPipeline
df = pd.read_csv('your-custom-corpus-here.csv', converters={'paragraphs': literal_eval})
cdqa_pipeline = QAPipeline(reader='bert_qa.joblib')
cdqa_pipeline.fit_retriever(df=df)
Making Predictions
To make a prediction:
cdqa_pipeline.predict(query='your question')
Evaluating Models
Evaluate your models with the following steps:
- Convert your DataFrame to a SQuAD-format json file:
- Annotate it to add ground truth question-answer pairs.
- Evaluate the pipeline:
from cdqa.utils.converters import df2squad
json_data = df2squad(df=df, squad_version='v1.1', output_dir='.', filename='dataset-name')
from cdqa.utils.evaluation import evaluate_pipeline
evaluate_pipeline(cdqa_pipeline, path_to_annotated_dataset.json)
Deployment
Deploy a cdQA REST API by executing:
export dataset_path='path-to-dataset.csv'
export reader_path='path-to-reader-model'
FLASK_APP=api.py flask run -h 0.0.0.0
You can then make queries using HTTPie:
http localhost:5000/api query=='your question here'
Troubleshooting
If you encounter issues during installation or usage, here are some pointers:
- Check if all paths are correctly set, especially for data and models.
- Ensure all required dependencies are installed, including Java OpenJDK for converters.
- Verify that the DataFrame for your data is correctly formatted as specified.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.