Welcome to our guide on using EMDR2—an innovative end-to-end training algorithm for open-domain question answering. Follow this step-by-step article to set up your system, download necessary data and checkpoints, and run your first training session!
Setup
To start your journey with EMDR2, we recommend utilizing recent PyTorch containers from NGC. The specific image version employed in this paper can be installed using the following command:
docker pull nvcr.io/nvidia/pytorch:20.03-py3
Before proceeding, ensure that the Nvidia container toolkit is installed on your machine.
Additional dependencies can be found in the provided Dockerfile located in the docker directory. You can build a new Docker image using the command:
cd docker
sudo docker build -t nvcr.io/nvidia/pytorch:20.10-py3-faiss-compiled .
To run the built image interactively, use the following command:
sudo docker run --ipc=host --gpus all -it --rm -v mntdisks:mntdisks nvcr.io/nvidia/pytorch:20.10-py3-faiss-compiled bash
Here, mntdisks represents the directory you want to mount.
Downloading Data and Checkpoints
You will need pretrained checkpoints and datasets to train your models effectively. You can download these files using the wget command along with the links provided below:
Required Data Files for Training
- Wikipedia evidence passages from DPR paper
- Pre-tokenized evidence passages and their titles
- Dataset-specific question-answer pairs
- BERT-large vocabulary file
Required Checkpoints and Embeddings
- Masked Salient Span (MSS) pre-trained retriever
- Masked Salient Span (MSS) pre-trained reader
- Precomputed evidence embedding using MSS retriever (32 GB file!)
Additionally, for training Masked Salient Spans (MSS), optional data can be downloaded here: MSS training data.
Usage
Several scripts are available for training models for both dense retrieval and open-domain QA tasks in the examples directory. Remember to change the data and checkpoint paths in these scripts before running them.
To replicate the answer generation results on the Natural Questions (NQ) dataset, run:
bash examples/openqa/emdr2_nq.sh
Similar scripts are available for TriviaQA, WebQuestions, and for training the dense retriever.
Training
End-to-end training is optimized for a single node consisting of 16 A100 GPUs with 40GB GPU memory. Within the codebase:
- The first set of 8 GPUs handles model training.
- The second set of 8 GPUs manages asynchronous evidence embedding.
- All 16 GPUs are utilized for online retrieval at every step.
If you have only 8 GPUs available, you can run the code by disabling asynchronous evidence embedding, but note that this may impact performance.
Pre-trained Checkpoints
For utilizing the pre-trained checkpoints, adjust the CHECKPOINT_PATH and EMBEDDING_PATH variables accordingly. Additionally, include the --no-load-optim option, and exclude the --emdr2-training, --async-indexer, and --index-reload-interval 500 options to enable inference mode. The memory needed for inference is lower, allowing you to evaluate the models on 4-8 GPUs.
Helper Scripts
Some essential scripts include:
- To save the retriever model for tasks such as top-K recall evaluation, use:
bash python tools/save_emdr2_models.py --submodel-name retriever --load e2eqatrivia --save e2eqatrivia/retriever
bash examples/helper-scripts/create_wiki_indexes_and_evaluate.sh
Issues
If you encounter any errors or bugs within the codebase, feel free to open a new issue or contact Devendra Singh Sachan at sachan.devendra@gmail.com.
For troubleshooting ideas, ensure that:
- All dependencies and Docker images have been correctly installed.
- Check your GPU configurations to avoid any performance issues.
- Refer to code documentation for scripts and adjustments.
- If issues persist, communicate with the team for expert guidance. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Happy coding and good luck with your open-domain question answering endeavors using EMDR2!

