How to Convert the intfloat-multilingual-e5-large Model to ONNX FP16 and INT8 for Vespa Embedding

Apr 13, 2024 | Educational

If you’re looking to implement efficient AI models for your applications, the intfloat-multilingual-e5-large model can be transformed into ONNX format with FP16 and INT8 quantization. This blog illustrates how to do this while using the Vespa embedding system. Let’s jump into the step-by-step process and some insights on troubleshooting!

Step 1: Model Conversion

Before diving into the implementation, it’s essential to understand that you’ll be converting the model using the Optimum toolkit. The model can be quantized in two sought-after formats: FP16 and INT8. Here’s how to do it:

Converting to INT8 Quantized Model

To create an INT8 quantized version, use the following command:


export_hf_model_from_hf.py --hf_model intfloatmultilingual-e5-large --output_dir me5-large
optimum-cli onnxruntime quantize --onnx_model .me5-large -o me5-large-large_quantized --avx512_vnni

Converting to FP16 Model

For FP16 conversion, execute:


export_hf_model_from_hf.py --hf_model intfloatmultilingual-e5-large --output_dir me5-large-

Followed by this:


import onnx
from onnxruntime.transformers.float16 import convert_float_to_float16
onnx_model = onnx.load(me5-large/intfloat-multilingual-e5-large.onnx)
model_fp16 = convert_float_to_float16(onnx_model, disable_shape_infer=True)
onnx.save(model_fp16, me5-large/intfloat-multilingual-e5-large_fp16.onnx)

Step 2: Update the Vespa services.xml

You’ll need to modify your services.xml file to reflect either FP16 or INT8 models. Below is an example configuration:



    
        https://huggingface.co/hotchpotch/vespa-onnx-intfloat-multilingual-e5-large/resolvemain/intfloat-multilingual-e5-large_fp16.onnx 
    
    
        https://huggingface.co/hotchpotch/vespa-onnx-intfloat-multilingual-e5-large/resolvemaintokenizer.json
    
    true
    mean

Step 3: Deploy Your Model

After configuring, you’re ready to deploy. Remember, the FP16 model has a larger file size, which might lead to extended deployment times. Use the command below:


vespa deploy --wait 1800 .

Understanding the Process with an Analogy

Think of transforming your model like preparing a dish in a kitchen:

Ingredient Selection: Sourcing your model (the ingredients).
Cooking Techniques: Choosing between FP16 and INT8 quantization (different cooking styles).
Recipe Instruction: Updating the services.xml is like following a recipe to bring everything together.
Final Presentation: Deploying your model is akin to serving the dish on a plate, ready for consumption!

Troubleshooting Tips

If you encounter issues during the conversion or deployment process, here are some troubleshooting ideas:

Ensure you are using Vespa version 8.325.46 or above for optimal FP16 functionality.
Double-check the paths and URLs in your services.xml for any typos.
Monitor your deployment time and ensure you’re not exceeding the wait time specified.
If the FP16 model is too large, consider switching to INT8 for faster performance.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Transforming the intfloat-multilingual-e5-large model into ONNX format for Vespa is a seamless process when following these steps. Each stage is important, from the conversion to deployment. Remember, leveraging the capabilities of various formats can boost your model’s efficiency significantly.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox