Welcome to the world of Gemma—a cutting-edge, lightweight text generation model created by Google! Whether you’re developing chatbots, generating content, or exploring new research avenues, Gemma can be your go-to tool. In this guide, we’ll walk you through the steps to get started with Gemma on your local machine or your cloud setup. Let’s dive in!
Getting Started
Before you can use Gemma, you need to install the necessary transformers library. Run the following command:
pip install -U transformers accelerate bitsandbytes
Running Gemma on a Single/Multi GPU
Imagine cooking a complex dish. You want all your ingredients to be accessible and your kitchen well-organized to create magic. Similarly, setting up Gemma requires a few steps to ensure everything is “cooked” perfectly:
- Ingredients (Packages):
transformers: For utilizing the Gemma model.accelerate: To manage acceleration packages.bitsandbytes: For quantization aspects.
- Recipe (Code):
Let’s break down the provided code in an analogy:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Think of tokenizer as your knife which chops the ingredients (text) into digestible pieces
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
# The model is your cooking apparatus that takes these chopped ingredients (tokens) and cooks (processes) them into a dish (output)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
device_map="auto",
torch_dtype=torch.bfloat16
)
input_text = "Write me a poem about Machine Learning."
# Your chopped (tokenized) input now ready to be put into the cooker (model)
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
# Your final dish is ready
print(tokenizer.decode(outputs[0]))
This script will load the model, tokenize the input, generate the text, and finally decode the generated tokens into a readable format.
Running the Model on Different Precisions
Gemma can run on different types of “fuels” to optimize performance. You can use bfloat16 (lighter and faster) or float32 (default and wider).
- Using
bfloat16Precision:
Already shown in the example above. - Using
float32Precision:
Omit thetorch_dtypeto usefloat32.
Quantized Versions for Resource Optimization
When you need to conserve power and run the model efficiently on lower-end hardware, you can use quantization.
- 8-bit Precision (int8):
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=quantization_config
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
2. 4-bit Precision :
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=quantization_config
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.
Conclusion
Gemma offers a versatile and powerful tool for various text generation applications. By following the steps outlined above, you can run Gemma efficiently on your setup and create amazing text-based outputs.
Happy Coding!!

