How to Utilize the Hebrew GPT-Neo Small Model

Nov 10, 2022 | Educational

If you’re eager to explore natural language processing in Hebrew, you’re in the right place! This guide will help you understand how to use the Hebrew GPT-Neo Small model, a powerful tool for generating Hebrew text based on EleutherAI’s work.

Understanding the Hebrew GPT-Neo Small Model

The Hebrew GPT-Neo Small model acts like a talented artist, ready to paint vibrant pictures with words. Trained on a variety of Hebrew corpora using advanced TPU technology, this model can generate text that expands your understanding and creativity in Hebrew.

Step-by-Step Instructions

Datasets: First, gather the datasets. You can find the various Hebrew corpora here. Additional datasets are highlighted below:

OSCAR Unshuffled Deduplicated Hebrew: Explore the dataset here.
CC100-Hebrew Dataset: More information can be found on its homepage.

Training Config: Ensure you have access to the training configurations which can be found here.
Usage in Google Colab: To get started, use the Google Colab Notebook available here.
Sample Code: Copy the following code to create your first text generation model:

!pip install tokenizers==0.10.2 transformers==4.6.0
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Norod78/hebrew-gpt_neo-small")
model = AutoModelForCausalLM.from_pretrained("Norod78/hebrew-gpt_neo-small", pad_token_id=tokenizer.eos_token_id)

prompt_text = "אני אוהב שוקולד ועוגות"
max_len = 512
sample_output_num = 3
seed = 1000
import numpy as np
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = 0 if not torch.cuda.is_available() else torch.cuda.device_count()
np.random.seed(seed)
torch.manual_seed(seed)

if n_gpu > 0:
    torch.cuda.manual_seed_all(seed)

model.to(device)
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt").to(device)

if encoded_prompt.size()[-1] == 0:
    input_ids = None
else:
    input_ids = encoded_prompt

if input_ids is not None:
    max_len += len(encoded_prompt[0]) if max_len + len(encoded_prompt[0]) <= 2048 else 2048
stop_token = "


				
				
				
				
				

    
        Stay Informed with the Newest F(x) Insights and Blogs
    
    
        Tech News and Blog Highlights, Straight to Your Inbox


				
				
				
				
				
				
				
				
				
				
				
				
				
			
				
				
				
				
				Let’s Build Success Together
				
				
				
					
						
				
				
				
				
				Name
				
			

				
				
				
				
				Company Name 
				
			

				
				
				
				
				Summarize Needs
				
			

				
				
				
				
				Email