In the world of academic research, efficiently extracting information from PDF documents is paramount. CERMINE is a powerful Java library designed to simplify this process. This guide will walk you through the steps of using CERMINE, whether as a standalone application, through Maven, or as a REST service.
What is CERMINE?
CERMINE is a Java library that provides tools for extracting metadata and content from academic PDF files. Developed at the Centre for Open Science, this tool serves researchers and developers seeking to automate data extraction in their workflows.
How to Use CERMINE
There are three main ways to utilize CERMINE based on your needs:
- Standalone Application: Best suited for processing large datasets locally.
- Maven Dependency: Ideal for integrating CERMINE’s API into your Java or Scala applications.
- REST Service: Useful for small datasets to utilize CERMINE’s functionalities through web requests.
Using CERMINE as a Standalone Application
The simplest way to process PDF files is via the standalone application. Here’s how to get started:
- Download the latest JAR file from the repository.
- Navigate to the directory where your JAR file is stored.
- Use the following command in your terminal:
$ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path pathtodirectorywithpdfs
Output Types
You can specify the desired output formats using the -outputs argument with a comma-separated list, which includes:
- jats: Document metadata and content in NLM JATS format
- text: Raw document text
- zones: Document text zones labeled with functional classes
- trueviz: Geometric structure of the document
- images: Extracted images
- bibtex: References in BibTeX format
Processing References and Affiliations
If you’re dealing with reference strings or affiliation strings, here’s how you can extract metadata:
- For References:
$ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.bibref.CRFBibReferenceParser -reference "the text of the reference" - For Affiliations:
$ java -cp cermine-impl-VERSION-jar-with-dependencies.jar pl.edu.icm.cermine.metadata.affiliation.CRFAffiliationParser -affiliation "the text of the affiliation"
Using CERMINE as Maven Dependency
To add CERMINE to your Java project through Maven, include the following in your pom.xml file:
pl.edu.icm.cermine
cermine-impl
${cermine.version}
icm
ICM repository
http://maven.icm.edu.pl/artifactory/repo
Example Code
Here’s how you can extract content or metadata using CERMINE’s API:
ContentExtractor extractor = new ContentExtractor();
InputStream inputStream = new FileInputStream("pathtopdffile");
extractor.setPDF(inputStream);
Element result = extractor.getContentAsNLM();
CRFBibReferenceParser parser = CRFBibReferenceParser.getInstance();
BibEntry reference = parser.parseBibReference("referenceText");
CRFAffiliationParser parser = new CRFAffiliationParser();
Element affiliation = parser.parse("affiliationText");
Using CERMINE as a REST Service
You can also interact with CERMINE via cURL, although it’s recommended for small datasets only:
- To extract content from a PDF file:
$ curl -X POST --data-binary @article.pdf --header Content-Type: application/pdf http://cermine.ceon.pl/extract.do - To extract metadata from a reference string:
$ curl -X POST --data "reference=the text of the reference" http://cermine.ceon.pl/parse.do - To extract metadata from an affiliation string:
$ curl -X POST --data "affiliation=the text of the affiliation" http://cermine.ceon.pl/parse.do
Troubleshooting
If you encounter any issues during installation or while running CERMINE, check the following:
- Ensure you have the correct version of Java installed.
- Verify that you are using the right path to your PDF files.
- For Maven users, ensure that your
pom.xmlfile is correctly configured. - If REST service calls fail, check your internet connection and the server status.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
A Little Analogy
Think of CERMINE like a professional librarian in a massive library of PDFs. Just as the librarian can quickly locate, summarize, and categorize books (or PDF documents), CERMINE processes and extracts information from the academic literature efficiently, saving you time and effort.
Conclusion
With CERMINE, researchers can automate the tedious and manual task of data extraction from scientific literature. By using this guide, you’ll be well-equipped to harness CERMINE’s capabilities.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

