If you’re eager to explore natural language processing in Hebrew, you’re in the right place! This guide will help you understand how to use the Hebrew GPT-Neo Small model, a powerful tool for generating Hebrew text based on EleutherAI’s work.
Understanding the Hebrew GPT-Neo Small Model
The Hebrew GPT-Neo Small model acts like a talented artist, ready to paint vibrant pictures with words. Trained on a variety of Hebrew corpora using advanced TPU technology, this model can generate text that expands your understanding and creativity in Hebrew.
Step-by-Step Instructions
- Datasets: First, gather the datasets. You can find the various Hebrew corpora here. Additional datasets are highlighted below:
- OSCAR Unshuffled Deduplicated Hebrew: Explore the dataset here.
- CC100-Hebrew Dataset: More information can be found on its homepage.
- Training Config: Ensure you have access to the training configurations which can be found here.
- Usage in Google Colab: To get started, use the Google Colab Notebook available here.
- Sample Code: Copy the following code to create your first text generation model:
!pip install tokenizers==0.10.2 transformers==4.6.0
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Norod78/hebrew-gpt_neo-small")
model = AutoModelForCausalLM.from_pretrained("Norod78/hebrew-gpt_neo-small", pad_token_id=tokenizer.eos_token_id)
prompt_text = "אני אוהב שוקולד ועוגות"
max_len = 512
sample_output_num = 3
seed = 1000
import numpy as np
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = 0 if not torch.cuda.is_available() else torch.cuda.device_count()
np.random.seed(seed)
torch.manual_seed(seed)
if n_gpu > 0:
torch.cuda.manual_seed_all(seed)
model.to(device)
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt").to(device)
if encoded_prompt.size()[-1] == 0:
input_ids = None
else:
input_ids = encoded_prompt
if input_ids is not None:
max_len += len(encoded_prompt[0]) if max_len + len(encoded_prompt[0]) <= 2048 else 2048
stop_token = "

