How to Use the OCR-2.0 Model via Transformers

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesstepfun-ai_GOT-OCR2_0

In this guide, we will walk you through the process of utilizing the OCR-2.0 model for Optical Character Recognition (OCR) tasks using the Hugging Face Transformers library. This powerful tool allows for seamless integration and advanced functionality that can help you extract and manage text from images effectively.

Getting Started

Before we delve into the implementation, here are the requirements you need to satisfy:

Python 3.10
torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0

Ensure you have these packages installed in your environment.

Implementation Guide

Follow these steps to run OCR-2.0:

from transformers import AutoModel, AutoTokenizer

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ucaslcl/GOT-OCR2_0", trust_remote_code=True)
model = AutoModel.from_pretrained("ucaslcl/GOT-OCR2_0", trust_remote_code=True, low_cpu_mem_usage=True, device_map="cuda", use_safetensors=True, pad_token_id=tokenizer.eos_token_id)

# Set model to evaluation mode
model = model.eval().cuda()

# Input your test image
image_file = "xxx.jpg"

# Perform OCR
res = model.chat(tokenizer, image_file, ocr_type="ocr")
print(res)

Here’s a simple analogy to understand the code above: Imagine the OCR model as a highly skilled librarian. The tokenizer is like a librarian’s cataloging system that organizes books so that the librarian can quickly find the information needed. The model itself is the skilled librarian, ready to interpret the text found in a book (the input image in this case).

Using Different OCR Types

You can customize the type of OCR you wish to execute based on your requirements:

Plain Text OCR: Use the command res = model.chat(tokenizer, image_file, ocr_type="ocr") for standard text extraction.
Formatted Text OCR: Modify the command to format your output: res = model.chat(tokenizer, image_file, ocr_type="format").
Fine-grained OCR: For detailed text extraction, include bounding boxes: res = model.chat(tokenizer, image_file, ocr_type="ocr", ocr_box=[]).
Multi-crop OCR: Use the command res = model.chat_crop(tokenizer, image_file, ocr_type="ocr") for extracting multiple sections of text.

Troubleshooting

Should you encounter any issues while using the OCR-2.0 model, consider the following troubleshooting tips:

Ensure all package dependencies are correctly installed and match the specified versions.
Check your CUDA setup if you experience performance issues or model loading errors.
Verify that the image path is accurately specified. A common error occurs if the image file cannot be located.

If you need further assistance, remember that for more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Explore More Projects

For those interested in additional multimodal projects, check out:

Conclusion

Using the OCR-2.0 model opens up new possibilities for text extraction from images. With this simple guide, you are now equipped to harness its power effectively.

Final Note

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox