If you’re looking to generate stunning visual content through text prompts in Chinese, this guide will walk you through how to use the CLIP-Huge-ZH model for guiding the Stable Diffusion 2 generation. We’ll take a closer look at the training details, usage, and troubleshooting tips to assist you along the way.
Understanding the Model
Imagine you have a bridge that connects two islands: one represents English text and the other represents Chinese text. The CLIP-Huge model serves as a bridge enabling text-based communication between these two islands. This model is specially trained to align the Chinese text embeddings with those of the English embeddings of the CLIP-VIT-H image encoder. By freezing the vision part of the model, we ensure that the bridge remains sturdy while we focus on improving the pathways for the Chinese text, allowing for smoother generation and understanding.
Training Details
The training process includes:
- Replacing the original English vocabulary with a Chinese vocabulary to ensure that the model understands the nuances of the Chinese language.
- Copying the original weights from the English text encoder.
- Freezing the image encoder parameters while allowing the text embeddings to be trained to align accurately with the English space.
- Finally, we unfreeze the entire text encoder after several training steps for better performance.
This thorough method was adopted to help the model gradually build a substantial understanding of the Chinese language in the context of text-image generation.
Usage of the Model
Now, let’s explore how to utilize this model in real-world applications such as zero-shot classification or guiding Stable Diffusion 2.
Zero-Shot Classification
Here’s how you can classify images with the trained model:
import torch
import numpy as np
import requests
from PIL import Image
from transformers import CLIPModel, CLIPFeatureExtractor, AutoTokenizer
model_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
model = CLIPModel.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = CLIPFeatureExtractor.from_pretrained(model_id)
# Online example from OFA-Sys
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
texts = ["杰尼龟", "妙蛙种子", "皮卡丘", "小火龙"]
# Compute image features
inputs = torch.from_numpy(processor(image).pixel_values[0]).unsqueeze(0)
image_features = model.get_image_features(pixel_values=inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
# Compute text features
inputs = tokenizer(text=texts, padding="max_length", max_length=77, return_tensors="pt")
input_ids, attention_mask = inputs.input_ids, inputs.attention_mask
input_dict = dict(input_ids=input_ids, attention_mask=attention_mask)
text_features = model.get_text_features(**input_dict)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
# Compute probabilities for each class
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=-1).detach().numpy()
print(np.around(probs, 3))
Guiding Stable Diffusion 2
To guide the Stable Diffusion 2 generation using this model, you can follow the setup below:
import torch
from diffusers import StableDiffusionPipeline
from transformers import AutoTokenizer, CLIPTextModel
clip_id = "lyua1225/clip-huge-zh-75k-steps-bs4096"
sd2_id = "stabilityai/stable-diffusion-2-1"
text_encoder = CLIPTextModel.from_pretrained(clip_id).half()
tokenizer = AutoTokenizer.from_pretrained(clip_id, trust_remote_code=True)
pipe = StableDiffusionPipeline.from_pretrained(sd2_id, torch_dtype=torch.float16, revision="fp16", tokenizer=tokenizer, text_encoder=text_encoder)
pipe.to("cuda")
image = pipe("赛博朋克风格的城市街道", num_inference_steps=20).images[0]
image.save("cyberpunk.jpeg")
Troubleshooting
While using this model, you might encounter several common issues:
- Performance Issues: The model may not perform optimally due to a smaller dataset size. It’s recommended to further train the model for improved results.
- Installation Errors: Ensure that all necessary libraries (like transformers and diffusers) are correctly installed and updated.
- CUDA Errors: Make sure your GPU is compatible and properly configured to handle the computations efficiently.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
By following this guide, you’re now equipped to harness the power of the CLIP-Huge-ZH model for extraordinary image generation from Chinese text inputs. Happy coding!

