How to Get Started with WebLLM: A High-Performance In-Browser LLM Inference Engine

Jan 30, 2023 | Data Science

Welcome to the world of WebLLM, a unique high-performance in-browser language model inference engine. With WebLLM, you can run complex language model inferencing directly from your web browser without needing server support. This guide will walk you through the installation, setup, and usage of WebLLM so you can start building powerful applications today!

Overview of WebLLM

WebLLM runs entirely inside the browser, utilizing hardware acceleration via WebGPU. This allows it to perform LLM operations efficiently while ensuring user privacy. It is fully compatible with the OpenAI API, enabling you to utilize the same functionalities as OpenAI’s models with any open-source models you prefer.

Key Features of WebLLM

  • In-Browser Inference: Perform complex LLM operations without server-side processing.
  • API Compatibility: Seamlessly integrate your app using the OpenAI API functionalities.
  • Structured JSON Generation: Generate structured outputs in JSON format with ease.
  • Extensive Model Support: Native support for a wide range of models such as Llama and Mistral.
  • Custom Model Integration: Adapt WebLLM for specific needs by deploying custom models.
  • Streaming & Real-Time Interactions: Supports chatflows and interactive applications.

Getting Started with Installation

Begin by installing WebLLM using npm or yarn:

npm install @mlc-aiweb-llm
yarn add @mlc-aiweb-llm

Alternatively, you can deliver via CDN. WebLLM can be imported directly through a URL, making it simple to use on platforms like jsfiddle and Codepen:

import * as webllm from "https://esm.run/@mlc-aiweb-llm";

Creating the MLCEngine

Your journey with WebLLM begins by creating an instance of the MLCEngine. This is analogous to setting up a team in a game. You need to establish your team (engine) before you can start playing (performing tasks). Below is an example of how to set it up:

const engine = await CreateMLCEngine(selectedModel, initProgressCallback);

Here, you can specify the model you want to work with, alongside a callback to monitor the loading progress.

Invoking Chat Completion

After initializing your engine, invoke chat completions using the following format:

const messages = [
  { role: 'system', content: 'You are a helpful AI assistant.' },
  { role: 'user', content: 'Hello!' },
];
const reply = await engine.chat.completions.create(messages);
console.log(reply.choices[0].message);

Streaming Support

WebLLM can generate real-time outputs, much like a live sports commentary, by passing stream: true to the chat creation call:

const chunks = await engine.chat.completions.create(messages, { stream: true });
let reply = "";
for await (const chunk of chunks) {
  reply += chunk.choices[0].delta.content;
}
console.log(reply);

Troubleshooting and Tips

If you encounter issues while installing or using WebLLM, consider the following troubleshooting tips:

  • Ensure all dependencies are properly installed.
  • Check if your coding environment supports WebGPU.
  • If loading the model takes too long, ensure your browser’s caching settings are appropriate.
  • Utilize browser developer tools to debug and monitor network requests.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Now that you’re equipped with the knowledge to start using WebLLM, dive in and create your own AI-powered web applications!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox