Do you want to summarize lengthy articles or documents with ease? With the powerful capabilities of the GPT-2 model, you can efficiently process and condense information. This guide will walk you through the steps to implement a document summarization feature using the GPT-2 architecture.
Getting Started
To embark on your journey of text summarization, here is what you will need:
- Python environment set up with necessary libraries.
- The Transformers library by Hugging Face installed.
- A sample document you want to summarize.
Loading the Model
The first step is to load the GPT-2 model and its tokenizer. The tokenizer will convert text into a format suitable for the model to understand.
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
model = GPT2LMHeadModel.from_pretrained("philippelabansummary_loop46")
tokenizer = GPT2TokenizerFast.from_pretrained("philippelabansummary_loop46")
Preparing the Document
Now, let’s prepare the document you want to summarize. Here’s an example:
document = """Bouncing Boulders Point to Quakes on Mars. A preponderance of boulder tracks on the red planet may be evidence of recent seismic activity. If a rock falls on Mars, and no one is there to see it, does it leave a trace? Yes, and its a beautiful herringbone-like pattern, new research reveals."""
Tokenizing the Document
Tokenization is the process of converting your document into input IDs for the model. It also truncates the document if it exceeds the maximum length specified.
tokenized_document = tokenizer([document], max_length=300, truncation=True, return_tensors='pt')['input_ids'].cuda()
Generating Summaries
With the document tokenized, you can now generate summaries. Here’s how it works:
outputs = model.generate(tokenized_document, do_sample=False, max_length=500, num_beams=4, num_return_sequences=4, no_repeat_ngram_size=6, return_dict_in_generate=True, output_scores=True)
candidate_sequences = outputs.sequences[:, tokenized_document.shape[1]:] # Get only the summaries
candidate_scores = outputs.sequences_scores.tolist()
for candidate_tokens, score in zip(candidate_sequences, candidate_scores):
summary = tokenizer.decode(candidate_tokens)
print(f"[Score: {score:.3f}] {summary[:summary.index('