How to Use SmolLM: Your Guide to Efficient Language Modeling

Aug 2, 2024 | Educational

Welcome, fellow language model enthusiasts! Today, we’ll dive into how you can get started with SmolLM, a series of state-of-the-art small language models designed to run efficiently while delivering impressive results. Whether you’re looking to run models on a CPU, a single GPU, or multiple GPUs, this guide has got you covered. Let’s get started!

Step 1: Install the Necessary Libraries

Begin by installing the transformers library via pip:

pip install transformers

This will allow you to access the SmolLM models and various utilities you’ll need in your project.

Step 2: Running the Model

Using Full Precision

Imagine preparing to cook a gourmet meal. First, you gather all your high-quality ingredients and tools. Similarly, running the SmolLM model requires setting up your “ingredients” (dependencies and configurations) efficiently.

To run the SmolLM model using full precision, follow these steps:

python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the model checkpoint and device
checkpoint = "HuggingFaceTB/SmolLM-1.7B"
device = "cuda" # or "cpu" for CPU usage

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

Using `torch.bfloat16`
If you think of your computer’s memory as a kitchen pantry, `torch.bfloat16` helps you make efficient use of that pantry space, allowing more ingredients (data) to fit without overwhelming the capacity.

pip install accelerate

Then adjust your code as follows:

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Define the model checkpoint and device
checkpoint = "HuggingFaceTB/SmolLM-1.7B"

# Initialize tokenizer and model with bfloat16 precision
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16).to("cuda")

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

# Check memory usage
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

Step 3: Quantized Versions through `bitsandbytes

Imagine you need to pack for a trip and only have limited luggage space. Quantizing the model is like organizing your items meticulously so that everything fits into a compact space.

For using 8-bit precision with bitsandbytes, follow these steps:

pip install bitsandbytes accelerate

Then use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure quantization settings
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
checkpoint = "HuggingFaceTB/SmolLM-1.7B"

# Initialize tokenizer and model with quantization
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config).to("cuda")

# Encode input and generate output
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)

# Print the generated output
print(tokenizer.decode(outputs[0]))

# Check memory usage for 8-bit and 4-bit models
print(f"Memory footprint (8-bit): {model.get_memory_footprint() / 1e6:.2f} MB")

# Adjust configuration and model initialization for 4-bit precision
# model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=BitsAndBytesConfig(load_in_4bit=True)).to("cuda")
# print(f"Memory footprint (4-bit): {model.get_memory_footprint() / 1e6:.2f} MB")

Troubleshooting

1. **Model Loading Issues:** Ensure you have the correct model checkpoint and all necessary packages installed. Double-check that your device setting (`cuda` or `cpu`) matches your hardware.

2. **Memory Errors:** If you encounter memory errors, consider using lower precision formats like `torch.bfloat16` or 8-bit quantization.

3. **Output Quality:** Sometimes, the generated output may not meet expectations. You can tweak the parameters or try different prompts to get improved results.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

That’s it! You now have a robust understanding of how to get started with the SmolLM models, whether you’re running on a basic setup or making the most of advanced hardware configurations. Happy modeling!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox