The Ultimate Guide to Using Docling for PDF Conversion

Dec 24, 2023 | Educational

Welcome to the world of Docling! This powerful tool allows you to effortlessly convert PDF documents into JSON and Markdown formats. Whether you’re dealing with academic papers, reports, or any other text-rich PDFs, Docling is here to make your life easier. Below, we’ll take you through installation, usage, and troubleshooting tips.

Why Use Docling?

  • Stability and lightning-fast conversion of PDFs to JSON or Markdown.
  • Comprehensive understanding of page layouts and table structures.
  • Extracts metadata such as titles, authors, and references.
  • Includes OCR capabilities for scanned PDFs.
  • Seamless integration with LLM app frameworks like LlamaIndex and LangChain.

Installation Guide

Getting started with Docling is simple. Follow these steps:

pip install docling

Note: Docling currently works on macOS and Linux environments. Windows platforms have not been tested.

Using Alternative PyTorch Distributions

Docling requires the PyTorch library. Depending on your architecture, you might want a different version to suit your system needs. Visit PyTorch for installation details. For example, if you’re on a Linux system that only needs CPU support, use:

pip install docling --extra-index-url https://download.pytorch.org/whl/cpu

Getting Started with Docling

Now that you’ve installed Docling, let’s dive into how to use it.

Convert a Single Document

You can convert individual PDF documents with the convert_single() function. Below is an example:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert_single(source)
print(result.render_as_markdown())  # output: ## Docling Technical Report[...]

Batch Convert Documents

For batch conversions, have a look at the example in batch_convert.py. Run it from your local clone:

python examples/batch_convert.py

The output will be stored in the .scratch directory.

Adjusting Pipeline Features

You can customize your conversion pipeline with the custom_convert.py script found in custom_convert.py. Here you can control pipeline options like:

  • Table structure recovery
  • OCR application

Control Table Extraction Options

To improve output quality, especially with tables, you can map the recognized structure back to PDF cells:

pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False

Document Size Limitations

Set limits on the file size and number of pages allowed for processing:

conv_input = DocumentConversionInput.from_paths(
    paths=[Path(".testdata/2206.01062.pdf")],
    limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from Binary PDF Streams

If you have PDFs in a binary format, you can convert them like this:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
conv_input = DocumentConversionInput.from_streams(docs)
results = doc_converter.convert(conv_input)

Troubleshooting Tips

If you encounter issues while using Docling, consider the following troubleshooting ideas:

  • Ensure your Python and Poetry versions are compatible.
  • Verify that your instantiation of PyTorch matches your hardware architecture.
  • If running scripts, check file paths and permissions.
  • Monitor system resources if you’re facing performance issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Further Exploration

Explore RAG showcases with Docling in standard LLM application frameworks through the following examples:

Conclusion

Docling is your go-to tool for PDF processing, offering an array of features for converting documents efficiently and effectively. As you delve into the world of Docling, remember this powerful tool can help streamline your work processes and enhance document management.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox