How to Use WARC-GPT: Your Guide to Exploring Web Archive Collections with AI

Nov 25, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_images_gitreadme_harvard-lil_warc-gpt

Welcome to our comprehensive guide on WARC-GPT — an experimental Retrieval Augmented Generation pipeline designed specifically for diving into web archive collections. This open-source tool makes it easier to interact with WARC files using advanced AI capabilities. Ready to get started? Let’s embark on this journey together!

Features of WARC-GPT

Retrieval Augmented Generation pipeline for WARC files.
Highly customizable with various Large Language Model (LLM) interactions.
Includes a REST API and a user-friendly Web UI.
Embeddings visualization capabilities.

Installation

Before diving into functionality, ensure you have the following machine-level dependencies:

Clone the project and install its dependencies using the following commands:

git clone https://github.com/harvard-lil/warc-gpt.git
poetry env use 3.11
poetry install

Configuring the Application

WARC-GPT uses environment variables for its settings. To configure, follow these steps:

Copy the example environment file:

cp .env.example .env

Edit the created .env file according to your needs.

A few essential notes:

WARC-GPT can work with both the OpenAI API and Ollama for local inference.
Ensure at least one of them is configured for functionality.
Default communication for the program is set to Ollama’s API at http://localhost:11434.

Ingesting WARCs

To start exploring, place the WARC files in the .warc directory and run the following command:

poetry run flask ingest

This command will:

Extract text from the WARC files.
Generate text embeddings.
Store embeddings for later queries.

Note: Running this command clears the .chromadb folder.

Starting the Server

To start the WARC-GPT server, simply run the following command:

poetry run flask run

You can specify a different port using the --port option if needed.

Interacting with the Web UI

Once the server is running, access the web UI at http://localhost:5000. The interface is designed to retrieve relevant excerpts from the knowledge base and handle chat history seamlessly, enabling intuitive interactions.

Interacting with the API

WARC-GPT offers a robust API for more customized interactions. Here are some key API endpoints.

[GET] /api/models: Retrieve a list of available models.
[POST] /api/search: Search against the vector store based on user prompts.
[POST] /api/complete: Generate text completions using specified models.

Visualizing Embeddings

Want to see your embeddings in action? WARC-GPT facilitates basic interactive T-SNE scatter plots. Generate them using:

poetry run flask visualize

You can even add questions to your visualization using:

poetry run flask visualize --questions="Who am I?;Who are you?"

Troubleshooting

If you encounter any issues:

Ensure that all dependencies are correctly installed.
Check that your .env file is configured correctly.
Verify that the WARC files and server are correctly set up.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Disclaimer

The Library Innovation Lab at Harvard Law School provides this experimental tool, focusing on principles of longevity, authenticity, reliability, and privacy. Your feedback is crucial for improving this tool.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Happy exploring!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox