How to Use the Donut Model for Receipt Text Extraction

Jun 14, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_9_170

In the world of artificial intelligence, the Donut model stands out as a powerful tool for extracting text from receipts. With the ability to understand and translate visual information into readable text, it serves as an indispensable utility for businesses and developers alike. This guide will walk you through the process of using the Donut model efficiently.

Understanding the Donut Model

The Donut model, trained on the AdamCodd donut-receipts dataset, utilizes a vision encoder (Swin Transformer) paired with a text decoder (BART). Imagine you have a translator that can read an image of a receipt and convert the content into text. The image encoder is like an artist that captures every detail on the canvas (the image), while the text decoder is the translator who narrates that scene in written form.

Here’s how its performance metrics shine:

Mean Accuracy: 0.895219
Character Error Rate (CER): 0.158358
Word Error Rate (WER): 1.673989
Loss: 0.326069
Edit Distance: 0.145293

Getting Started

To harness the potential of the Donut model, you need to set up your environment and run some initial code. Follow these steps:


import torch
import re
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
processor = DonutProcessor.from_pretrained("adamcodd/donut-receipts-extract")
model = VisionEncoderDecoderModel.from_pretrained("adamcodd/donut-receipts-extract")
model.to(device)

def load_and_preprocess_image(image_path: str, processor):
    image = Image.open(image_path).convert("RGB")
    pixel_values = processor(image, return_tensors="pt").pixel_values
    return pixel_values

def generate_text_from_image(model, image_path: str, processor, device):
    pixel_values = load_and_preprocess_image(image_path, processor)
    pixel_values = pixel_values.to(device)
    
    model.eval()
    with torch.no_grad():
        task_prompt = "s_receipt"
        decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
        decoder_input_ids = decoder_input_ids.to(device)
        
        generated_outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=model.decoder.config.max_position_embeddings,
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            early_stopping=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
            return_dict_in_generate=True
        )
    
    decoded_text = processor.batch_decode(generated_outputs.sequences)[0]
    decoded_text = decoded_text.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    decoded_text = re.sub(r'^[^,]*?,', '', decoded_text, count=1).strip()
    decoded_text = processor.token2json(decoded_text)
    return decoded_text
# Example usage
image_path = "path_to_your_image"  # Replace with your image path
extracted_text = generate_text_from_image(model, image_path, processor, device)
print("Extracted Text:", extracted_text)

Using the Code

In the code snippet above:

The model is loaded and moved to the GPU if available.
Images are preprocessed to ensure they are in the correct format for analysis.
The text is generated using the trained model based on the preprocessed images.

Troubleshooting

If you encounter issues during implementation, consider the following troubleshooting tips:

Ensure that the image path is correct and that the image is in a supported format (JPEG, PNG).
Double-check that the necessary packages, like PyTorch and Transformers, are installed and up to date.
If you run into errors related to GPU allocation, try running the code on CPU.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With the Donut model, text extraction from receipts becomes a breeze, opening doors for automation and efficient processing in various applications. However, the model is tailored for receipts, and its performance on other document types may not be optimal. As you explore its capabilities, remember that continuous updates and improvements are underway.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox