How to Get Started with DeepSeek-VL

Mar 17, 2024 | Educational

Welcome to your journey of exploring DeepSeek-VL, a powerful open-source Vision-Language Model designed for real-world applications in understanding vision and language. In this guide, we’ll walk you through the installation, basic usage, and troubleshooting tips for DeepSeek-VL, so you can harness its full potential.

1. Introduction

DeepSeek-VL stands as a bridge between vision and language. With capabilities that span logical diagrams, web content, scientific literature, and more, this model is tailored for complex scenarios. For an in-depth look, check out the research paper DeepSeek-VL: Towards Real-World Vision-Language Understanding. You can also find the code and resources in the Github Repository.

![Sample Image](https://github.com/deepseek-ai/DeepSeek-VL/blob/main/images/sample.jpg)

2. Model Summary

DeepSeek-VL-7b-base boasts significant capabilities, using a hybrid vision encoder that supports high-resolution image inputs. It integrates abundant textual training, making it adept at understanding and generating interpretations across various multimodal scenarios. Essentially, think of it as having a well-versed librarian who can not only narrate stories but also understand complex diagrams and scientific texts.

3. Quick Start

Installation

To get started, ensure that you have a Python >= 3.8 environment set up. Then, run the following commands in your terminal:

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .

Simple Inference Example

Here’s a quick example to help you use DeepSeek-VL for inference. This will allow you to input an image and retrieve a descriptive response from the model:

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/deepseek-vl-7b-chat"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)

vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
    {
        "role": "User",
        "content": "Describe each stage of this image.",
        "images": ["./images/training_pipelines.png"]
    },
    {
        "role": "Assistant",
        "content": ""
    }
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

CLI Chat

If you prefer command-line interface interactions, you can use the following command:

python cli_chat.py --model_path "deepseek-ai/deepseek-vl-7b-chat"
# ore local path
python cli_chat.py --model_path "local model path"

4. License

This code repository is licensed under the MIT License. The use of DeepSeek-VL Base/Chat models is governed by the DeepSeek Model License, which allows for commercial use.

5. Troubleshooting

While using DeepSeek-VL, you may encounter some challenges. Here are a few troubleshooting tips:

Installation Errors: Ensure your Python environment is correctly set up. It is crucial to install the dependencies as mentioned.
Model Loading Issues: Verify that you have provided the correct model path. Double-check the cloning and installation steps.
Memory Errors: If you run into CUDA memory issues, try reducing the image resolution or using a smaller model variant.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

As you embark on your exploration with DeepSeek-VL, remember that practice makes perfect. Utilize the resources available and don’t hesitate to reach out for community support when needed.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox