How to Extract Text from PDF Files Using PDFLayoutTextStripper

Apr 12, 2024 | Programming

In the world of data extraction, PDF files can often present a challenge, especially when it comes to retaining their layout. Enter PDFLayoutTextStripper—a powerful tool that converts PDF files into text files while preserving the original layout. This is particularly useful when you’re trying to extract content from tables or forms embedded in these PDF documents. This guide will take you through the installation, usage, and some troubleshooting tips for this helpful library.

Use Cases

  • Data extraction from a table in a PDF file: Using PDFLayoutTextStripper allows you to precisely grab that tabular data without losing its structure.
  • Data extraction from a form in a PDF file: Capture form entries effortlessly, ensuring that data integrity remains intact.

Here’s a glimpse of what these use cases look like:

Data extraction from a table example

Data extraction from a form example

How to Install PDFLayoutTextStripper

Maven Dependency

If you’re using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>io.github.jonathanlink</groupId>
    <artifactId>PDFLayoutTextStripper</artifactId>
    <version>2.2.3</version>
</dependency>

Manual Installation

  1. Install Apache PDFBox manually (to get the v2.0.6 click here). Ensure to also download its dependencies: commons-logging.jar and fontbox.
  2. Warning: Only PDFBox versions from 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.

How to Use PDFLayoutTextStripper

On Linux or Mac

Navigate to the PDFLayoutTextStripper directory and execute the following commands to compile and run the program:

cd PDFLayoutTextStripper
javac -cp .:pathtopdfbox-2.0.6.jar:pathtocommons-logging-1.2.jar:pathtoPDFLayoutTextStripper:fontbox-2.0.6.jar *.java
java -cp .:pathtopdfbox-2.0.6.jar:pathtocommons-logging-1.2.jar:pathtoPDFLayoutTextStripper:fontbox-2.0.6.jar test

On Windows

The process is the same as for Linux but ensure you replace : with ; in the classpath.

Sample Code

Here’s a brief sample of how you can implement the extraction:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
    public static void main(String[] args) {
        String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("samplesbus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(string);
    }
}

Think of the PDFLayoutTextStripper like a meticulous librarian who has the ability to reorganize books and documents arranged in a specific format back into a readable text format without losing their categorization. It scans the PDF, identifies the structure (much like categorizing books on shelves), and extracts the content while maintaining its original layout.

Troubleshooting

  • Ensure that all dependencies are correctly added and compatible. The version incompatibility may lead to runtime errors.
  • If you encounter errors related to file not being found, double-check the file path to your PDF document.
  • Make sure your Java version is compatible with Apache PDFBox and PDFLayoutTextStripper.
  • If all else fails, consult the documentation on Apache PDFBox website for additional resources.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox