Welcome to the world of FROMAGe, where language and visuals converge! In this article, we’ll guide you through setting up, training, and evaluating the FROMAGe model, which emphasizes harmonizing language models with images for efficient multimodal inputs and outputs.
Setup Instructions
First, let’s ensure you have the right environment for this powerful model. Follow these steps to get started:
1. Environment Setup
- Create a new virtual environment and install the required libraries:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:/home/path/to/fromage
2. Pretrained Checkpoints
The FROMAGe model weights are relatively small (around 11MB) and can be found in fromage_model folder after cloning the repository. Additionally, we offer a stronger model with a more robust visual linear layer, which is useful in dialogue settings. You can find this model in fromage_model/fromage_vis4.
3. Precomputed Embeddings for Image Retrieval
Visual embeddings for Conceptual Captions images can be downloaded from this URL. Place the cc3m_embeddings.pkl in your fromage_model directory for image retrieval tasks. If you need to precompute these embeddings for different images, edit fromage/extract_img_embs.py accordingly.
Running Inference
To see the FROMAGe model in action, check out the FROMAGe_example_notebook.ipynb for examples of calling the model for inference. This notebook showcases the results presented in the paper using greedy decoding. However, be aware that image outputs may vary slightly over time.
Training the FROMAGe Model
Getting your model ready for action? Let’s discuss how to train FROMAGe:
1. Preparing CC3M Dataset
Our model utilizes the Conceptual Captions dataset. After downloading the required images and captions, format them into a .tsv file following this structure:
caption image
A picture of a cat cat.png
Mountains mountain.png
Make sure to save these .tsv files in the dataset folder.
2. Running Training Jobs
Once your data is ready, initiate the training job with the below command:
randport=$(shuf -i8000-9999 -n1) # Generate a random port number
python -u main.py \
--dist-url tcp:127.0.0.1:$randport \
--dist-backend nccl \
--multiprocessing-distributed --world-size 1 --rank 0 \
--dataset=cc3m --val-dataset=cc3m \
--opt-version=facebookopt-6.7b \
--visual-model=openaiclip-vit-large-patch14 \
--exp_name=fromage_exp --image-dir=data \
--log-base-dir=runs \
--batch-size=180 --val-batch-size=100 \
--learning-rate=0.0003 --precision=bf16 --print-freq=100
For specific GPUs, you might need to adjust batch size or disable certain flags to optimize performance.
Troubleshooting
Should you encounter issues during setup or execution, consider these troubleshooting tips:
- If your model doesn’t train, check your dataset formatting.
- For memory issues, lower the batch size or enable gradient accumulation.
- If data errors occur, run the command manually to identify any problems.
- Don’t hesitate to seek advice or insights from the community!
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Advanced Features
Beyond basics, the model allows pruning of weights to save space, unit tests to confirm local execution, and evaluation scripts for contextual image retrieval and text generation.
Gradio Demo
Feel free to run your version of the Gradio demo locally by executing the command:
python demo/app.py
You can also explore other Hugging Face spaces for FROMAGe for more hands-on experience.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

