Running Large PyTorch Models on Multiple GPUs with Tensor Parallel

Oct 25, 2023 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_BlackSamorez_tensor_parallel

Tired of being limited by the capacity of a single GPU while working with large PyTorch models? Fear not! With tensor parallelism, you can run large models smoothly across multiple GPUs with just one line of code. This article is your user-friendly guide to getting started with tensor parallel, troubleshooting any issues you might face, and much more!

Why Choose Tensor Parallel?

The magic of tensor parallelism lies in its ability to split the weights of your model across multiple GPUs, allowing for potentially linear speedup during training and inference. Think of it as a pizza cut into slices, where each slice represents a portion of your model’s weights, and each GPU acts as a hungry friend enjoying a slice. Instead of trying to fit the whole pizza into one person’s hands, everyone can enjoy their share at the same time!

Installation of Tensor Parallel

Installing tensor parallel is as easy as pizza delivery! Follow these steps:

Latest stable version (recommended): pip install tensor_parallel
For the bleeding-edge version, use: pip install https://github.com/BlackSamorez/tensor_parallel/archive/main.zip

Usage

To utilize tensor parallelism, simply wrap your PyTorch model with tp.tensor_parallel(). Here’s a brief use case to guide you:

import transformers
import tensor_parallel as tp

tokenizer = transformers.AutoTokenizer.from_pretrained('facebook/opt-13b')
model = transformers.AutoModelForCausalLM.from_pretrained('facebook/opt-13b')  # use opt-125m for testing
model = tp.tensor_parallel(model, ['cuda:0', 'cuda:1'])  # each GPU has half the weights

inputs = tokenizer("A cat sat", return_tensors='pt')['input_ids'].to('cuda:0')
outputs = model.generate(inputs, num_beams=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on my lap for a few minutes ...
model(input_ids=inputs, labels=inputs).loss.backward()  # training works as usual

Advanced Parameters for Tensor Parallel

For those who want to dive deeper, tensor parallel also supports advanced configurations:

device_ids: Specify the devices to use; defaults to all available GPUs.
output_device: Designates where model outputs will be sent.
tensor_parallel_config: Allows the use of custom parallelism strategies.

Saving the Model

To save a model in a non-tensor_parallel context, wrap the model in a save_tensor_parallel context like this:

import torch
import transformers
import tensor_parallel as tp

model = tp.tensor_parallel(transformers.AutoModelForCausalLM.from_pretrained('facebook/opt-13b'))
# After training...
with tp.save_tensor_parallel(model):
    torch.save(model.state_dict(), 'model.pt')  # or model.save_pretrained('path/to/save')

Memory Efficient Dispatch

If you encounter memory challenges while working with large models, tensor parallel allows you to efficiently dispatch state dictionaries without loading the full model into memory. This is akin to only taking out the slices of pizza you want when you’re hungry, rather than cooking the whole pizza if you don’t need it!

FAQ

Q: Can I use tensor parallel in Google Colab?
A: Colab has a single GPU, so it is not suitable for tensor parallelism. However, Kaggle offers two T4 GPUs for free to all phone-verified accounts.

Troubleshooting

If you encounter challenges such as NCCL errors or unexpected hanging while running tensor parallel, consider these suggestions:

Try restarting your environment with export TENSOR_PARALLEL_USE_NATIVE=1 or run your model on a single device.
If you believe you’ve found a bug, please report it to our issue tracker.
For minor installation or optimization issues unrelated to tensor_parallel, we recommend seeking assistance from other sources.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Now that you’re equipped with the knowledge of how to implement tensor parallelism, go forth and unleash the full potential of your PyTorch models across multiple GPUs!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox