How to Start with DeepSeek-VL: A Comprehensive Guide

Mar 18, 2024 | Educational

Welcome to the world of DeepSeek-VL, an innovative open-source Vision-Language Model that bridges the gap between visual and textual understanding for various real-world applications. In this guide, we will walk you through the process of installing DeepSeek-VL, running inference, and using its command-line interface. Buckle up!

1. Introduction

DeepSeek-VL is a state-of-the-art model designed to interpret complex data such as logical diagrams, web pages, scientific literature, and natural images, all while enabling an embodied intelligence in sophisticated scenarios. Its capabilities are geared toward a broad range of vision and language understanding tasks.

For a deeper dive, check out the following resources:

2. Model Summary

The DeepSeek-VL-7b-base model utilizes the SigLIP-L and SAM-B hybrid vision encoders. This robust architecture supports a resolution of 1024 x 1024 for image inputs. Built upon the DeepSeek-LLM-7b-base, the model is trained on an impressive corpus of about 2 trillion text tokens and leverages around 400 billion vision-language tokens.

3. Quick Start

Installation

Before we dive into using DeepSeek-VL, let’s ensure we set up our environment. Follow these steps:

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .

Make sure you are using a Python environment that is version 3.8 or higher.

Simple Inference Example

Ready to see DeepSeek-VL in action? Here’s how you can run a simple inference example:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-7b-base"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "Describe each stage of this image.",
        "images": ["./images/training_pipelines.png"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

This example is akin to assembling a complex puzzle. Each piece — the model, the tokenizer, and the visual inputs — comes together to create a coherent image description, as if the model is piecing together its understanding of the visual elements step by step.

CLI Chat

If you prefer using the command line interface, you can initiate a chat with the model by running the following:

python cli_chat.py --model_path "deepseek-ai/deepseek-vl-7b-base"
# or for a local path
python cli_chat.py --model_path "local model path"

4. License

DeepSeek-VL is under the MIT License, allowing for commercial applications. Make sure to check the DeepSeek Model License for any usage restrictions.

5. Troubleshooting

If you encounter issues during installation or while running your models, here are some troubleshooting tips:

Make sure your Python environment version is >= 3.8.
Recheck the paths provided are accessible and correctly typed.
Ensure that your GPU drivers and CUDA are properly installed if you’re aiming for model evaluation on GPU.
Read the console errors carefully. Look out for path issues or missing model files.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At The Forefront of AI

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox