How to Use the YanoljaEEVE-Korean-Instruct-10.8B Model with Llama.cpp

Sep 12, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_23_195

In the realm of artificial intelligence, utilizing robust models is essential for delivering insightful and interactive experiences. This guide walks you through the process of employing the YanoljaEEVE-Korean-Instruct-10.8B model, especially when it is quantized using llama.cpp. Whether you’re working with a GPU or CPU, we will cover everything you need to know!

Requirements

An installation of Python and pip.
Access to a compatible GPU or CPU.
Familiarity with downloading and using Hugging Face models.

Installation Instructions

For GPU

To utilize this model on a GPU, follow these steps:

CMAKE_ARGS=-DLLAMA_CUBLAS=on FORCE_CMAKE=1 
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

For CPU

Should you decide to proceed with a CPU setup, run the following command:

CMAKE_ARGS=-DLLAMA_CUBLAS=on FORCE_CMAKE=1 
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose
pip install huggingface_hub

Loading the Model

Next, we will download and instantiate the model. Here’s how:

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import time
from pprint import pprint

model_name_or_path = "heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF"
model_basename = "ggml-model-Q4_K_M.gguf"
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)
print(model_path)
# For CPU
# lcpp_llm = Llama(model_path=model_path, n_threads=2)
# For GPU
lcpp_llm = Llama(model_path=model_path, n_threads=2, n_batch=512, n_gpu_layers=43, n_ctx=4096)

Creating and Sending Prompts

Now, let’s understand how to interact with the model. Think of this process as a conversation between a user and an AI assistant:

Here’s how you prepare your conversation:

prompt_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the users questions.\nHuman: {prompt}\nAssistant:\n"
text = "What is the weather today?"
prompt = prompt_template.format(prompt=text)
start = time.time()
response = lcpp_llm(prompt=prompt, max_tokens=256, temperature=0.5, top_p=0.95, top_k=50, stop=['\n'], echo=True)
pprint(response)
print(time.time() - start)

Understanding Model Performance

Once you run the above code, you will gain insights into the model’s performance. Just as a chef evaluates their dish by taste, you can evaluate the model’s outputs:

Load time: The duration taken to load the model.
Sample time: How long it took to produce a response.
Total time: The accumulated time for both loading and responding.

Adjusting parameters such as temperature and top_k can influence the “creativeness” and diversity of the responses. Experimenting with different values provides a unique recipe for success!

Troubleshooting Common Issues

While working on this project, you may encounter some hiccups. Here are a few troubleshooting tips:

If you face installation issues, ensure that your Python version is compatible.
For memory errors, consider reducing the n_batch or n_gpu_layers parameters.
Check your internet connectivity if you experience problems downloading the model.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox