Automate Question Extraction with a Language Model

Jun 16, 2023 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitlangchainreadme_nestordemeure_question_extractor

Large language models (LLMs) have revolutionized how we can work with textual data, especially when it comes to instruction tuning by providing a set of questions and answers. However, to fine-tune these models on your own data, you need a plethora of questions and answers regarding that specific data. This manual process can be cumbersome and time-consuming. Fortunately, we have a solution to automate this grueling task by utilizing ChatGPT to extract question-answer pairs from existing textual data.

Installation Guide

To get started, follow these simple steps to install the necessary tools:

Clone the repository holding the code.
Install the required Python packages:

tiktoken: The OpenAI tokenizer.
openai: Official OpenAI API client.
langchain: The glue code that connects models and utilities.

How to Use the Script

This script efficiently transforms a folder of markdown (.md) documents into a .json file that contains a neatly organized list of questions and answers derived from the source documents.

To run the code, follow these straightforward steps:

Set the appropriate file paths in the question_extractor.py file, specifying both the input folder and the output path.
Ensure your OpenAI API key is available in the environment.
Execute the script using Python:

python3 question_extractor.py

After the execution, you will find all extracted questions and answers wonderfully documented in a .json file located at your designated output path.

Understanding the Inner Workings

Think of the script like a librarian who efficiently processes numerous books and quickly sorts out questions and answers.

Here’s how it operates:

The code begins by looping through all the files in your designated folder.
For each file, it extracts a series of questions based on a prompt that asks the model to formulate a list of questions solely from the provided text.
Next, it takes those questions and responds to each one using another specific prompt that instructs the model to generate informative answers based exclusively on the corresponding text chunk.
The script is designed to process files concurrently, ensuring efficiency and speed while managing token limits to get the best results.
If any text is lengthy, it smartly segments the text at the highest markdown heading level, recursively, until manageable paragraphs are created.
This approach can process the entire NERSC documentation in about 6 minutes, generating over 8000 questions efficiently.

Troubleshooting Tips

If you face any issues while executing this script, consider the following suggestions:

Ensure all file paths are correctly set in the question_extractor.py file.
Double-check your OpenAI API key and make sure it’s properly integrated into your environment.
If the script is running too slowly, consider limiting the number of files initially processed to identify any bottlenecks.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Potential Improvements

While this script is quite efficient, there’s always room for growth. Here are some potential enhancements:

Enable GPT-4 integration for enhanced answer quality, albeit at the cost of increased runtime and expenses.
Add functionality to save intermediate outputs, allowing users to resume interrupted processes conveniently.
Streamline dependencies by using the OpenAI client directly, enhancing performance.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox