This is an unofficial implementation of TrOCR based on the TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models and the Hugging Face transformers library. There is also a repository by the authors of the paper (link). The code in this repository is merely a more simple wrapper to quickly get started with training and deploying this model for character recognition tasks.
Results:
After training on a dataset of 2000 samples for 8 epochs, we got an accuracy of 96.5%. Both the training and the validation datasets were not completely clean. Otherwise, even higher accuracies would have been possible.
Architecture:
(TrOCR architecture. Taken from the original paper.) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models], Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei, Preprint 2021.
How to Setup and Use TrOCR
1. Setup
Clone the repository and make sure to have conda or miniconda installed. Then go into the directory of the cloned repository and run:
bash conda env create -n trocr --file environment.yml
conda activate trocr
This should install all necessary libraries.
Training without GPU:
It is highly recommended to use a CUDA GPU, but everything also works on the CPU. For that, install from file environment-cpu.yml instead. In case the process terminates with the warning “killed”, reduce the batch size to fit into the working memory.
2. Using the Repository
There are three modes: inference, validation, and training. All three modes can either start with a local model in the right path (see src/constants/paths) or with the pretrained model from Hugging Face. Inference and Validation use the local model by default, while training starts with the Hugging Face model by default.
Inference (Prediction):
bash
python -m src.predict image_files # predict image files using the trained local model
python -m src.predict data/img1.png data/img2.png # list all image files
python -m src.predict data/* # also works with shell expansion
python -m src.predict data/* --no-local-model # uses the pretrained Hugging Face model
Validation:
bash
python -m src.validate # uses pretrained local model
python -m src.validate --no-local-model # loads pretrained model from Hugging Face
Training:
bash
python -m src.train # starts with pretrained model from Hugging Face
python -m src.train --local-model # starts with pretrained local model
For validation and training, input images should be in directories train and val and the labels should be in gt_labels.csv. In the CSV, each row should consist of the image name and its corresponding label, for example img1.png,a (in quotes, if necessary).
3. Integrating into Other Projects
If you want to use the predictions as part of a bigger project, you can just use the interface provided by the TrocrPredictor in main. For that, make sure to run all code as python modules. See the following code example:
from PIL import Image
from trocr.src.main import TrocrPredictor
# load images
image_names = [data/img1.png, data/img2.png]
images = [Image.open(img_name) for img_name in image_names]
# directly predict on Pillow Images or on file names
model = TrocrPredictor()
predictions = model.predict_images(images)
predictions = model.predict_for_file_names(image_names)
# print results
for i, file_name in enumerate(image_names):
print(f"Prediction for {file_name}: {predictions[i]}")
4. Adapting the Code
It should be easy to adapt the code for other input formats or use cases:
- Learning Rate, Batch size, Train Epoch Count, Logging, Word Length:
src/configs/constants.py - Input Paths, Model Checkpoint Path:
src/configs/paths.py - Different label format:
src/dataset.py:load_filepaths_and_labels
The word length constant is very important for padding all labels to the same length for facilitating batch training. Some experimentation might be needed here. For us, padding to 8 worked well. If you want to change specifics of the model, you can supply a TrOCRConfig object to the transformers interface. See here for more details.
5. Troubleshooting
If the setup fails to work, please let me know in a GitHub issue! Sometimes sub-dependencies update and become incompatible with other dependencies, so the dependency list needs to be updated. Feel free to submit issues with questions about the implementation as well. For questions about the paper or the architecture, please get in touch with the authors.
If you encounter issues with memory, try the following:
- Reduce the batch size.
- Check to ensure the required libraries are correctly installed.
- Make sure your dataset paths are correct in the configurations.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

