How to Use SmolLM: Your Guide to Efficient Language Modeling

Aug 2, 2024 | Educational

Welcome, fellow language model enthusiasts! Today, we’ll dive into how you can get started with SmolLM, a series of state-of-the-art small language models designed to run efficiently while delivering impressive results. Whether you’re looking to run models on a CPU, a single GPU, or multiple GPUs, this guide has got you covered. Let’s get started!

Step 1: Install the Necessary Libraries

Begin by installing the transformers library via pip:

pip install transformers

This will allow you to access the SmolLM models and various utilities you’ll need in your project.

Step 2: Running the Model

Using Full Precision

Imagine preparing to cook a gourmet meal. First, you gather all your high-quality ingredients and tools. Similarly, running the SmolLM model requires setting up your “ingredients” (dependencies and configurations) efficiently.

To run the SmolLM model using full precision, follow these steps:

python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model checkpoint and device
checkpoint = "HuggingFaceTB/SmolLM-1.7B"
device = "cuda" # or "cpu" for CPU usage

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

Using torch.bfloat16
If you think of your computer’s memory as a kitchen pantry, torch.bfloat16 helps you make efficient use of that pantry space, allowing more ingredients (data) to fit without overwhelming the capacity.

pip install accelerate

Then adjust your code as follows:

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Define the model checkpoint and device
checkpoint = "HuggingFaceTB/SmolLM-1.7B"

# Initialize tokenizer and model with bfloat16 precision
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16).to("cuda")

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

# Check memory usage
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

Step 3: Quantized Versions through `bitsandbytes

Imagine you need to pack for a trip and only have limited luggage space. Quantizing the model is like organizing your items meticulously so that everything fits into a compact space.

For using 8-bit precision with bitsandbytes, follow these steps:

pip install bitsandbytes accelerate

Then use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure quantization settings
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
checkpoint = "HuggingFaceTB/SmolLM-1.7B"

# Initialize tokenizer and model with quantization
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config).to("cuda")

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

# Check memory usage for 8-bit and 4-bit models
print(f"Memory footprint (8-bit): {model.get_memory_footprint() / 1e6:.2f} MB")

# Adjust configuration and model initialization for 4-bit precision
# model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=BitsAndBytesConfig(load_in_4bit=True)).to("cuda")
# print(f"Memory footprint (4-bit): {model.get_memory_footprint() / 1e6:.2f} MB")

Troubleshooting

1. **Model Loading Issues:** Ensure you have the correct model checkpoint and all necessary packages installed. Double-check that your device setting (`cuda` or `cpu`) matches your hardware.

2. **Memory Errors:** If you encounter memory errors, consider using lower precision formats like `torch.bfloat16` or 8-bit quantization.

3. **Output Quality:** Sometimes, the generated output may not meet expectations. You can tweak the parameters or try different prompts to get improved results.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

That’s it! You now have a robust understanding of how to get started with the SmolLM models, whether you’re running on a basic setup or making the most of advanced hardware configurations. Happy modeling!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox