How to Guide Using the Chinese CLIP-Huge Model for Stable Diffusion 2

Dec 20, 2022 | Educational

If you’re looking to generate stunning visual content through text prompts in Chinese, this guide will walk you through how to use the CLIP-Huge-ZH model for guiding the Stable Diffusion 2 generation. We’ll take a closer look at the training details, usage, and troubleshooting tips to assist you along the way.

Understanding the Model

Imagine you have a bridge that connects two islands: one represents English text and the other represents Chinese text. The CLIP-Huge model serves as a bridge enabling text-based communication between these two islands. This model is specially trained to align the Chinese text embeddings with those of the English embeddings of the CLIP-VIT-H image encoder. By freezing the vision part of the model, we ensure that the bridge remains sturdy while we focus on improving the pathways for the Chinese text, allowing for smoother generation and understanding.

Training Details

The training process includes:

Replacing the original English vocabulary with a Chinese vocabulary to ensure that the model understands the nuances of the Chinese language.
Copying the original weights from the English text encoder.
Freezing the image encoder parameters while allowing the text embeddings to be trained to align accurately with the English space.
Finally, we unfreeze the entire text encoder after several training steps for better performance.

This thorough method was adopted to help the model gradually build a substantial understanding of the Chinese language in the context of text-image generation.

Usage of the Model

Now, let’s explore how to utilize this model in real-world applications such as zero-shot classification or guiding Stable Diffusion 2.

Zero-Shot Classification

Here’s how you can classify images with the trained model:

import torch
import numpy as np
import requests
from PIL import Image
from transformers import CLIPModel, CLIPFeatureExtractor, AutoTokenizer

model_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
model = CLIPModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = CLIPFeatureExtractor.from_pretrained(model_id)

# Online example from OFA-Sys
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
texts = ["杰尼龟", "妙蛙种子", "皮卡丘", "小火龙"]

# Compute image features
inputs = torch.from_numpy(processor(image).pixel_values[0]).unsqueeze(0)
image_features = model.get_image_features(pixel_values=inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)

# Compute text features
inputs = tokenizer(text=texts, padding="max_length", max_length=77, return_tensors="pt")
input_ids, attention_mask = inputs.input_ids, inputs.attention_mask
input_dict = dict(input_ids=input_ids, attention_mask=attention_mask)
text_features = model.get_text_features(**input_dict)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)

# Compute probabilities for each class
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=-1).detach().numpy()
print(np.around(probs, 3))

Guiding Stable Diffusion 2

To guide the Stable Diffusion 2 generation using this model, you can follow the setup below:

import torch
from diffusers import StableDiffusionPipeline
from transformers import AutoTokenizer, CLIPTextModel

clip_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
sd2_id = "stabilityai/stable-diffusion-2-1"

text_encoder = CLIPTextModel.from_pretrained(clip_id).half()
tokenizer = AutoTokenizer.from_pretrained(clip_id, trust_remote_code=True)
pipe = StableDiffusionPipeline.from_pretrained(sd2_id, torch_dtype=torch.float16, revision="fp16", tokenizer=tokenizer, text_encoder=text_encoder)

pipe.to("cuda")
image = pipe("赛博朋克风格的城市街道", num_inference_steps=20).images[0]
image.save("cyberpunk.jpeg")

Troubleshooting

While using this model, you might encounter several common issues:

Performance Issues: The model may not perform optimally due to a smaller dataset size. It’s recommended to further train the model for improved results.
Installation Errors: Ensure that all necessary libraries (like transformers and diffusers) are correctly installed and updated.
CUDA Errors: Make sure your GPU is compatible and properly configured to handle the computations efficiently.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

By following this guide, you’re now equipped to harness the power of the CLIP-Huge-ZH model for extraordinary image generation from Chinese text inputs. Happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox