How to Summarize Documents Using GPT-2 Based Model

Feb 10, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_14_1076

In this article, we will explore how to utilize a GPT-2 based architecture to summarize documents efficiently. If you’re looking to condense lengthy articles into bite-sized summaries, we’ve got you covered!

Getting Started with the Model

The first step to using the model is loading it along with the tokenizer. This process is akin to preparing a chef’s kitchen before a big cooking session—gathering all necessary ingredients and tools will set the groundwork for a successful culinary experience.

from transformers import GPT2LMHeadModel, GPT2TokenizerFast
model = GPT2LMHeadModel.from_pretrained("philippelabansummary_loop46")
tokenizer = GPT2TokenizerFast.from_pretrained("philippelabansummary_loop46")

Feeding Your Document for Summary

Once the model is ready, you can input your document to be summarized. Think of the document as the meal you’re preparing—you want to ensure it’s fresh and appropriately portioned before introducing it into the cooking process.

document = "Bouncing Boulders Point to Quakes on Mars..." # Your document here
tokenized_document = tokenizer([document], max_length=300, truncation=True, return_tensors='pt')['input_ids'].cuda()

Generating the Summary

With your tokenized document in place, it’s time to generate the summary. At this stage, the model comes into play and begins crafting a condensed version of your lengthy document, much like a skilled chef creating an exquisite dish from ordinary ingredients.

input_shape = tokenized_document.shape
outputs = model.generate(tokenized_document, do_sample=False, max_length=500, num_beams=4, num_return_sequences=4, no_repeat_ngram_size=6, return_dict_in_generate=True, output_scores=True)
candidate_sequences = outputs.sequences[:, input_shape[1]:]  # Remove the encoded text
candidate_scores = outputs.sequences_scores.tolist()
for candidate_tokens, score in zip(candidate_sequences, candidate_scores):
    summary = tokenizer.decode(candidate_tokens)
    print([Score: %.3f] % score, summary[:summary.index('END')])

Example Output

Here’s an example of potential summaries generated:

[Score: -0.113]  These tracks have been spotted elsewhere in the solar system, including on the red planet...
[Score: -0.119]  Now researchers have spotted thousands of tracks on the red planet created by tumbling boulders...
[Score: -0.214]  Here are answers to those questions posed by scientists investigating the tracks...
[Score: -0.388]  These are the kinds of questions swirling around whether these tracks exist on Mars...

Troubleshooting Tips

If you encounter issues while summarizing documents, consider the following troubleshooting ideas:

Check if the model and tokenizer are properly loaded without any errors.
Ensure that your document does not exceed the max length defined in the tokenizer settings.
Verify the structure of your input document; any formatting inconsistencies can lead to unexpected outputs.
Experiment with parameters like num_beams and max_length for better results.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Summarizing documents using a GPT-2 based model can be efficient and effective when done correctly. By following these steps, you can transform lengthy texts into concise summaries that capture essential information.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox