Welcome to our comprehensive guide on WARC-GPT — an experimental Retrieval Augmented Generation pipeline designed specifically for diving into web archive collections. This open-source tool makes it easier to interact with WARC files using advanced AI capabilities. Ready to get started? Let’s embark on this journey together!
Features of WARC-GPT
- Retrieval Augmented Generation pipeline for WARC files.
- Highly customizable with various Large Language Model (LLM) interactions.
- Includes a REST API and a user-friendly Web UI.
- Embeddings visualization capabilities.
Installation
Before diving into functionality, ensure you have the following machine-level dependencies:
Clone the project and install its dependencies using the following commands:
git clone https://github.com/harvard-lil/warc-gpt.git
poetry env use 3.11
poetry install
Configuring the Application
WARC-GPT uses environment variables for its settings. To configure, follow these steps:
- Copy the example environment file:
- Edit the created .env file according to your needs.
cp .env.example .env
A few essential notes:
- WARC-GPT can work with both the OpenAI API and Ollama for local inference.
- Ensure at least one of them is configured for functionality.
- Default communication for the program is set to Ollama’s API at http://localhost:11434.
Ingesting WARCs
To start exploring, place the WARC files in the .warc directory and run the following command:
poetry run flask ingest
This command will:
- Extract text from the WARC files.
- Generate text embeddings.
- Store embeddings for later queries.
Note: Running this command clears the .chromadb folder.
Starting the Server
To start the WARC-GPT server, simply run the following command:
poetry run flask run
You can specify a different port using the --port option if needed.
Interacting with the Web UI
Once the server is running, access the web UI at http://localhost:5000. The interface is designed to retrieve relevant excerpts from the knowledge base and handle chat history seamlessly, enabling intuitive interactions.
Interacting with the API
WARC-GPT offers a robust API for more customized interactions. Here are some key API endpoints.
- [GET] /api/models: Retrieve a list of available models.
- [POST] /api/search: Search against the vector store based on user prompts.
- [POST] /api/complete: Generate text completions using specified models.
Visualizing Embeddings
Want to see your embeddings in action? WARC-GPT facilitates basic interactive T-SNE scatter plots. Generate them using:
poetry run flask visualize
You can even add questions to your visualization using:
poetry run flask visualize --questions="Who am I?;Who are you?"
Troubleshooting
If you encounter any issues:
- Ensure that all dependencies are correctly installed.
- Check that your .env file is configured correctly.
- Verify that the WARC files and server are correctly set up.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Disclaimer
The Library Innovation Lab at Harvard Law School provides this experimental tool, focusing on principles of longevity, authenticity, reliability, and privacy. Your feedback is crucial for improving this tool.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Happy exploring!

