Welcome to the world of generating very long, textbook quality pre-training data! In this article, we’ll guide you through the steps to set up and use a project designed to help you generate comprehensive and detailed content. Think of it as having a highly knowledgeable assistant who can churn out quality textbooks on various topics, tailored to your needs. Let’s dive in!
Prerequisites
- Make sure you have Python 3.9+ (ideally 3.11) installed on your system.
- Have PostgreSQL installed. If you’re a Mac user, you can easily install it using
brew install postgres.
Setting Up the Project
Let’s get this project rolling! Follow these steps:
- Open your command line and create a new database with:
- Clone the repository:
- Navigate into the cloned folder:
- Install the required dependencies:
- Run the migration command:
psql postgres -c create database textbook;
git clone https://github.com/VikParuchuri/textbook_quality.git
cd textbook_quality
poetry install
invoke migrate-dev
Configuration Options
Before you can start generating content, you need to configure your environment. Here’s how:
- Create a file named local.env in the root directory to keep your secret keys.
- For quality generation, set up your keys and choose your backend:
- OpenAI Key:
OPENAI_KEY=sk-xxxxxx - Choose a retrieval backend:
- For Serply:
SERPLY_KEY=... - For SerpAPI:
SERPAPI_KEY=... - To disable:
SEARCH_BACKEND=none
- For Serply:
- By default, the generator uses GPT-3.5. To use GPT-4, set the following variables:
LLM_TYPE=gpt-4LLM_INSTRUCT_TYPE=gpt-4
Generating Content
Now that everything is set up, you can start generating topics, augmenting them, and creating entire textbooks!
Generating Topics from Scratch
To create new topics, run:
python topic_generator.py --iterations
For instance:
python topic_generator.py "computer science" "python_cs_titles.json" --iterations 50
Augmenting Topics from Seeds
If you have existing topics, you can augment them:
python topic_augmentor.py --domain
Generating Textbooks
To generate textbooks from your topics:
python book_generator.py --workers
Example:
python book_generator.py topics.json books.jsonl --workers 5
An Analogy for Better Understanding
Imagine this process as baking a cake. The prerequisites (Python and PostgreSQL) are like gathering your ingredients and baking tools. Configuring your environment is akin to setting your oven temperature and preparing your baking pan. Finally, generating the content is like mixing your ingredients and putting the cake in the oven—it’s where the magic happens! Just as a cake takes time to bake, your content will be generated based on your input and the system’s capabilities.
Troubleshooting Tips
If you encounter any issues while setting up or using the project, consider these tips:
- Check your Python version by running
python --versionto ensure it meets the prerequisite. - Ensure that PostgreSQL is properly installed and running. You can check its status with
brew services list. - If you face issues with the generation scripts, verify that your API keys are correctly entered in local.env.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Extending the Project
This project is adaptable. If you wish to add new features or retrieval methods, you can explore:
- LLM adapters within appllmadaptors
- Retrieval methods in appservicesadaptors
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

