How to Utilize cdQA: A Guide to Closed Domain Question Answering

Aug 20, 2020 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_cdqa-suite_cdQA

Closed Domain Question Answering (cdQA) is an innovative system designed to answer questions within a specific context or domain. Built on the robust HuggingFace transformers library, cdQA allows users to train models to effectively extract relevant information. This guide outlines the installation, data preparation, model training, and troubleshooting approaches for cdQA.

Installation
Getting Started
Deployment
Troubleshooting

Installation

With pip

Use pip to install cdQA easily:

pip install cdqa

From source

To install from the source, clone the repository and install:

git clone https://github.com/cdqa-suite/cdQA.git
cd cdQA
pip install -e .

Hardware Requirements

You can run cdQA on:

CPU: AWS EC2 t2.medium Deep Learning AMI (Ubuntu) Version 22.0
GPU: AWS EC2 p3.2xlarge Deep Learning AMI (Ubuntu) Version 22.0 + a single Tesla V100 16GB

Getting Started

Preparing Your Data

Manual

To utilize cdQA, create a pandas DataFrame with the required columns:

title              paragraphs
-----------------  ------------------------------------------------------
The Article Title  [Paragraph 1 of Article, ... , Paragraph N of Article]

With converters

Use converters to automatically create a dataframe from your documents. For example, utilize the pdf_converter:

from cdqa.utils.converters import pdf_converter
df = pdf_converter(directory_path=path_to_pdf_folder)

Ensure you have Java OpenJDK installed to use converters.

Downloading Pre-trained Models and Data

Download models and datasets using:

from cdqa.utils.download import download_squad, download_model, download_bnpp_data
directory = path_to_directory
download_squad(dir=directory)
download_bnpp_data(dir=directory)
download_model(bert-squad_1.1, dir=directory)

Training Models

Fit the pipeline with:

import pandas as pd
from ast import literal_eval
from cdqa.pipeline import QAPipeline

df = pd.read_csv('your-custom-corpus-here.csv', converters={'paragraphs': literal_eval})
cdqa_pipeline = QAPipeline(reader='bert_qa.joblib')
cdqa_pipeline.fit_retriever(df=df)

Making Predictions

To make a prediction:

cdqa_pipeline.predict(query='your question')

Evaluating Models

Evaluate your models with the following steps:

Convert your DataFrame to a SQuAD-format json file:

from cdqa.utils.converters import df2squad
json_data = df2squad(df=df, squad_version='v1.1', output_dir='.', filename='dataset-name')

Annotate it to add ground truth question-answer pairs.
Evaluate the pipeline:

from cdqa.utils.evaluation import evaluate_pipeline
evaluate_pipeline(cdqa_pipeline, path_to_annotated_dataset.json)

Deployment

Deploy a cdQA REST API by executing:

export dataset_path='path-to-dataset.csv'
export reader_path='path-to-reader-model'
FLASK_APP=api.py flask run -h 0.0.0.0

You can then make queries using HTTPie:

http localhost:5000/api query=='your question here'

Troubleshooting

If you encounter issues during installation or usage, here are some pointers:

Check if all paths are correctly set, especially for data and models.
Ensure all required dependencies are installed, including Java OpenJDK for converters.
Verify that the DataFrame for your data is correctly formatted as specified.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox