How to Summarize Text Using GPT-2 Model

Feb 10, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_15_1076

Do you want to summarize lengthy articles or documents with ease? With the powerful capabilities of the GPT-2 model, you can efficiently process and condense information. This guide will walk you through the steps to implement a document summarization feature using the GPT-2 architecture.

Getting Started

To embark on your journey of text summarization, here is what you will need:

Python environment set up with necessary libraries.
The Transformers library by Hugging Face installed.
A sample document you want to summarize.

Loading the Model

The first step is to load the GPT-2 model and its tokenizer. The tokenizer will convert text into a format suitable for the model to understand.

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

model = GPT2LMHeadModel.from_pretrained("philippelabansummary_loop46")
tokenizer = GPT2TokenizerFast.from_pretrained("philippelabansummary_loop46")

Preparing the Document

Now, let’s prepare the document you want to summarize. Here’s an example:

document = """Bouncing Boulders Point to Quakes on Mars. A preponderance of boulder tracks on the red planet may be evidence of recent seismic activity. If a rock falls on Mars, and no one is there to see it, does it leave a trace? Yes, and its a beautiful herringbone-like pattern, new research reveals."""

Tokenizing the Document

Tokenization is the process of converting your document into input IDs for the model. It also truncates the document if it exceeds the maximum length specified.

tokenized_document = tokenizer([document], max_length=300, truncation=True, return_tensors='pt')['input_ids'].cuda()

Generating Summaries

With the document tokenized, you can now generate summaries. Here’s how it works:

outputs = model.generate(tokenized_document, do_sample=False, max_length=500, num_beams=4, num_return_sequences=4, no_repeat_ngram_size=6, return_dict_in_generate=True, output_scores=True)

candidate_sequences = outputs.sequences[:, tokenized_document.shape[1]:]  # Get only the summaries
candidate_scores = outputs.sequences_scores.tolist()

for candidate_tokens, score in zip(candidate_sequences, candidate_scores):
    summary = tokenizer.decode(candidate_tokens)
    print(f"[Score: {score:.3f}] {summary[:summary.index('


				
				
				
				
				

    
        Stay Informed with the Newest F(x) Insights and Blogs
    
    
        Tech News and Blog Highlights, Straight to Your Inbox


				
				
				
				
				
				
				
				
				
				
				
				
				
			
				
				
				
				
				Let’s Build Success Together
				
				
				
					
						
				
				
				
				
				Name
				
			

				
				
				
				
				Company Name 
				
			

				
				
				
				
				Summarize Needs
				
			

				
				
				
				
				Email