How to Get Started with PaperMage: A Unified Toolkit for Document Processing

Oct 5, 2020 | Data Science

Welcome to our comprehensive guide on setting up and using PaperMage! Whether you are a researcher wanting to manipulate PDF documents or a developer looking to integrate powerful features into your applications, this article will outline everything you need to know to get started with PaperMage.

1. Setting Up PaperMage on Your Machine

To begin your journey with PaperMage, the first step is to set up your environment. For this, we will be using Conda. Follow these steps:

  • Open your terminal or command prompt.
  • Create a new Conda environment:
  • conda create -n papermage python=3.11
  • Activate the newly created environment:
  • conda activate papermage
  • Install PaperMage from the source or PyPi:
    • If installing from source, use the command:
    • pip install -e .[dev,predictors,visualizers]
    • For installation via PyPi:
    • pip install papermage.[dev,predictors,visualizers]
  • If you are on macOS, you’ll also want to install Poppler:
  • conda install poppler

2. Unit Testing Your Installation

To ensure everything is set up correctly, you can run unit tests:

  • Run all tests:
  • bash python -m pytest
  • Check the last failed test:
  • bash python -m pytest --lf --no-cov -n0
  • Run a specific test by name:
  • bash python -m pytest -k TestPDFPlumberParser --no-cov -n0

3. Quick Start Using PaperMage

Now that you have your setup completed, let’s create your first document from a PDF file:

from papermage.recipes import CoreRecipe
recipe = CoreRecipe()
doc = recipe.run('tests/fixtures/papermage.pdf')

Understanding the Document Output

The Document class is a powerful builder of your structured data. It’s like your chef’s recipe, taking raw ingredients (the PDF content) and shaping them into a delightful dish (structured information). At its most basic level, a Document contains just text, which you can access through the .symbols layer. However, it gets really interesting when you segment the content into components, like pages and rows:

for page in doc.pages:
    print(f"=== PAGE: {page.id} ===")
    for row in page.rows:
        print(row.text)

Think of it like a book: each page holds text, but we can also dissect it into paragraphs, lines, and sentences. You can explore all these segments through the provided iterables, which helps you understand the structure and relationships within the document.

4. Exploring Entity Objects

Entity objects are the building blocks of the document, storing details about their contents and locations. Here are some key attributes:

  • .spans: Points within the Document’s text.
  • .boxes: Rectangular areas corresponding to text regions.
  • .metadata: Free-form data storage for extra entity information.

5. Create, Save, and Load Documents

Manual Creation

You can manually create a Document by combining parsers, rasterizers, and predictors. Each plays a vital role, much like components of a car; without one, you may struggle to drive!

Saving Your Document

Saving a Document can be done simply using:

import json
with open('filename.json', 'w') as f_out:
    json.dump(doc.to_json(), f_out, indent=4)

Loading Your Document

To reconstruct your Document:

with open('filename.json') as f_in:
    doc_dict = json.load(f_in)
    doc = Document.from_json(doc_dict)

Troubleshooting Tips

  • If you encounter issues while creating your environment, ensure you have the latest version of Conda installed.
  • Scripts might behave differently based on your shell. Be attentive to quote placements in the install commands.
  • For any unresolved questions or to connect with others in the community, don’t hesitate to reach out through **[fxis.ai](https://fxis.ai/edu)**.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Final Thoughts

PaperMage provides a robust toolkit for researchers and developers alike. Now that you’ve learned the basics, we encourage you to dive into experimenting with its powerful features and potentially build your own document-processing tools!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox