How to Use CERMINE for Extracting Metadata from PDF Files

Feb 21, 2024 | Programming

In the world of academic research, efficiently extracting information from PDF documents is paramount. CERMINE is a powerful Java library designed to simplify this process. This guide will walk you through the steps of using CERMINE, whether as a standalone application, through Maven, or as a REST service.

What is CERMINE?

CERMINE is a Java library that provides tools for extracting metadata and content from academic PDF files. Developed at the Centre for Open Science, this tool serves researchers and developers seeking to automate data extraction in their workflows.

How to Use CERMINE

There are three main ways to utilize CERMINE based on your needs:

  • Standalone Application: Best suited for processing large datasets locally.
  • Maven Dependency: Ideal for integrating CERMINE’s API into your Java or Scala applications.
  • REST Service: Useful for small datasets to utilize CERMINE’s functionalities through web requests.

Using CERMINE as a Standalone Application

The simplest way to process PDF files is via the standalone application. Here’s how to get started:

  • Download the latest JAR file from the repository.
  • Navigate to the directory where your JAR file is stored.
  • Use the following command in your terminal:
    $ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path pathtodirectorywithpdfs

Output Types

You can specify the desired output formats using the -outputs argument with a comma-separated list, which includes:

  • jats: Document metadata and content in NLM JATS format
  • text: Raw document text
  • zones: Document text zones labeled with functional classes
  • trueviz: Geometric structure of the document
  • images: Extracted images
  • bibtex: References in BibTeX format

Processing References and Affiliations

If you’re dealing with reference strings or affiliation strings, here’s how you can extract metadata:

  • For References:
    $ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.bibref.CRFBibReferenceParser -reference "the text of the reference"
  • For Affiliations:
    $ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.metadata.affiliation.CRFAffiliationParser -affiliation "the text of the affiliation"

Using CERMINE as Maven Dependency

To add CERMINE to your Java project through Maven, include the following in your pom.xml file:



  pl.edu.icm.cermine
  cermine-impl
  ${cermine.version}


  icm
  ICM repository
  http://maven.icm.edu.pl/artifactory/repo

Example Code

Here’s how you can extract content or metadata using CERMINE’s API:


ContentExtractor extractor = new ContentExtractor();
InputStream inputStream = new FileInputStream("pathtopdffile");
extractor.setPDF(inputStream);
Element result = extractor.getContentAsNLM();

CRFBibReferenceParser parser = CRFBibReferenceParser.getInstance();
BibEntry reference = parser.parseBibReference("referenceText");

CRFAffiliationParser parser = new CRFAffiliationParser();
Element affiliation = parser.parse("affiliationText");

Using CERMINE as a REST Service

You can also interact with CERMINE via cURL, although it’s recommended for small datasets only:

  • To extract content from a PDF file:
    $ curl -X POST --data-binary @article.pdf --header Content-Type: application/pdf http://cermine.ceon.pl/extract.do
  • To extract metadata from a reference string:
    $ curl -X POST --data "reference=the text of the reference" http://cermine.ceon.pl/parse.do
  • To extract metadata from an affiliation string:
    $ curl -X POST --data "affiliation=the text of the affiliation" http://cermine.ceon.pl/parse.do

Troubleshooting

If you encounter any issues during installation or while running CERMINE, check the following:

  • Ensure you have the correct version of Java installed.
  • Verify that you are using the right path to your PDF files.
  • For Maven users, ensure that your pom.xml file is correctly configured.
  • If REST service calls fail, check your internet connection and the server status.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

A Little Analogy

Think of CERMINE like a professional librarian in a massive library of PDFs. Just as the librarian can quickly locate, summarize, and categorize books (or PDF documents), CERMINE processes and extracts information from the academic literature efficiently, saving you time and effort.

Conclusion

With CERMINE, researchers can automate the tedious and manual task of data extraction from scientific literature. By using this guide, you’ll be well-equipped to harness CERMINE’s capabilities.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox