How to Use the MultiModal MultiLingual (3ML) Model

Jun 18, 2024 | Educational

The MultiModal MultiLingual (3ML) model is an impressive apparatus that blends linguistic and visual comprehension, allowing it to tackle document, image, and chart questioning tasks seamlessly. Much like a chef who can switch between preparing different cuisines, this model has been optimized to outshine its counterparts, such as GPT-4-turbo and Gemini 1.0 Pro, in accomplishing various tasks. This guide will help you get started with the model, troubleshoot common issues, and help you enjoy its benefits!

Getting Started with 3ML

To make the most of this powerful model, follow these simple steps to set it up in Google Colab:

  • Open the provided Google Colab link: Open In Colab.
  • You can also find the model’s source code on GitHub: Github Source.
  • Remember, for optimal performance, utilize English or Chinese for document and image understanding.

Understanding the Core Code

The provided Python script might look daunting at first glance, but think of it as a recipe in a cookbook. Just as the recipe outlines the ingredients and instructions needed to prepare a dish, the script shows you how to instruct the model to execute tasks. Here’s a breakdown:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

device = "cuda"
modelPath = "nikravan/glm-4vq"

tokenizer = AutoTokenizer.from_pretrained(modelPath, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    modelPath,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
)

query = 'explain all the details in this picture'
image = Image.open("a3.png").convert('RGB')

inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                        add_generation_prompt=True, tokenize=True, 
                                        return_tensors="pt",
                                        return_dict=True)

inputs = inputs.to(device)
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

The script imports necessary libraries, loads the model, prepares the input (an image and a query), and generates a response. Each step is crucial to ensure the process flows smoothly.

Troubleshooting Tips

If you encounter issues while using the 3ML model, here are a few steps you can take:

  • Check Your Environment: Ensure that you’re running on a compatible GPU environment within Google Colab, especially if you want to leverage CUDA.
  • Verify the Image Path: Make sure the image file (“a3.png”) you are trying to analyze is correctly uploaded to the Colab workspace.
  • Language Support: Remember that for optimal performance in document and image understanding, using English or Chinese is recommended, although you can chat in any supported language.
  • Model Limitations: Some tasks may not produce results if the model is pushed beyond its capabilities. Be patient and try simplifying your queries.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In conclusion, the MultiModal MultiLingual (3ML) model offers a robust solution for addressing complex questions involving images, documents, and charts. With its superior performance and multi-lingual capabilities, it stands out as a versatile tool in the realm of AI.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox