How to Use EMDR2 for Open-Domain Question Answering

Dec 4, 2023 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_DevSinghSachan_emdr2

Welcome to our guide on using EMDR2—an innovative end-to-end training algorithm for open-domain question answering. Follow this step-by-step article to set up your system, download necessary data and checkpoints, and run your first training session!

Setup

To start your journey with EMDR2, we recommend utilizing recent PyTorch containers from NGC. The specific image version employed in this paper can be installed using the following command:

docker pull nvcr.io/nvidia/pytorch:20.03-py3

Before proceeding, ensure that the Nvidia container toolkit is installed on your machine.

Additional dependencies can be found in the provided Dockerfile located in the docker directory. You can build a new Docker image using the command:

cd docker
sudo docker build -t nvcr.io/nvidia/pytorch:20.10-py3-faiss-compiled .

To run the built image interactively, use the following command:

sudo docker run --ipc=host --gpus all -it --rm -v mntdisks:mntdisks nvcr.io/nvidia/pytorch:20.10-py3-faiss-compiled bash

Here, mntdisks represents the directory you want to mount.

Downloading Data and Checkpoints

You will need pretrained checkpoints and datasets to train your models effectively. You can download these files using the wget command along with the links provided below:

Required Data Files for Training

Required Checkpoints and Embeddings

Additionally, for training Masked Salient Spans (MSS), optional data can be downloaded here: MSS training data.

Usage

Several scripts are available for training models for both dense retrieval and open-domain QA tasks in the examples directory. Remember to change the data and checkpoint paths in these scripts before running them.

To replicate the answer generation results on the Natural Questions (NQ) dataset, run:

bash examples/openqa/emdr2_nq.sh

Similar scripts are available for TriviaQA, WebQuestions, and for training the dense retriever.

Training

End-to-end training is optimized for a single node consisting of 16 A100 GPUs with 40GB GPU memory. Within the codebase:

The first set of 8 GPUs handles model training.
The second set of 8 GPUs manages asynchronous evidence embedding.
All 16 GPUs are utilized for online retrieval at every step.

If you have only 8 GPUs available, you can run the code by disabling asynchronous evidence embedding, but note that this may impact performance.

Pre-trained Checkpoints

For utilizing the pre-trained checkpoints, adjust the CHECKPOINT_PATH and EMBEDDING_PATH variables accordingly. Additionally, include the --no-load-optim option, and exclude the --emdr2-training, --async-indexer, and --index-reload-interval 500 options to enable inference mode. The memory needed for inference is lower, allowing you to evaluate the models on 4-8 GPUs.

Helper Scripts

Some essential scripts include:

To save the retriever model for tasks such as top-K recall evaluation, use:

bash python tools/save_emdr2_models.py --submodel-name retriever --load e2eqatrivia --save e2eqatrivia/retriever

To create evidence embeddings from a retriever checkpoint and conduct top-K recall evaluation, execute:

bash examples/helper-scripts/create_wiki_indexes_and_evaluate.sh

Issues

If you encounter any errors or bugs within the codebase, feel free to open a new issue or contact Devendra Singh Sachan at sachan.devendra@gmail.com.

For troubleshooting ideas, ensure that:

All dependencies and Docker images have been correctly installed.
Check your GPU configurations to avoid any performance issues.
Refer to code documentation for scripts and adjustments.
If issues persist, communicate with the team for expert guidance. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Happy coding and good luck with your open-domain question answering endeavors using EMDR2!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox