Welcome to the world of Docling! This powerful tool allows you to effortlessly convert PDF documents into JSON and Markdown formats. Whether you’re dealing with academic papers, reports, or any other text-rich PDFs, Docling is here to make your life easier. Below, we’ll take you through installation, usage, and troubleshooting tips.
Why Use Docling?
- Stability and lightning-fast conversion of PDFs to JSON or Markdown.
- Comprehensive understanding of page layouts and table structures.
- Extracts metadata such as titles, authors, and references.
- Includes OCR capabilities for scanned PDFs.
- Seamless integration with LLM app frameworks like LlamaIndex and LangChain.
Installation Guide
Getting started with Docling is simple. Follow these steps:
pip install docling
Note: Docling currently works on macOS and Linux environments. Windows platforms have not been tested.
Using Alternative PyTorch Distributions
Docling requires the PyTorch library. Depending on your architecture, you might want a different version to suit your system needs. Visit PyTorch for installation details. For example, if you’re on a Linux system that only needs CPU support, use:
pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
Getting Started with Docling
Now that you’ve installed Docling, let’s dive into how to use it.
Convert a Single Document
You can convert individual PDF documents with the convert_single()
function. Below is an example:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert_single(source)
print(result.render_as_markdown()) # output: ## Docling Technical Report[...]
Batch Convert Documents
For batch conversions, have a look at the example in batch_convert.py. Run it from your local clone:
python examples/batch_convert.py
The output will be stored in the .scratch directory.
Adjusting Pipeline Features
You can customize your conversion pipeline with the custom_convert.py
script found in custom_convert.py. Here you can control pipeline options like:
- Table structure recovery
- OCR application
Control Table Extraction Options
To improve output quality, especially with tables, you can map the recognized structure back to PDF cells:
pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False
Document Size Limitations
Set limits on the file size and number of pages allowed for processing:
conv_input = DocumentConversionInput.from_paths(
paths=[Path(".testdata/2206.01062.pdf")],
limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)
Convert from Binary PDF Streams
If you have PDFs in a binary format, you can convert them like this:
buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
conv_input = DocumentConversionInput.from_streams(docs)
results = doc_converter.convert(conv_input)
Troubleshooting Tips
If you encounter issues while using Docling, consider the following troubleshooting ideas:
- Ensure your Python and Poetry versions are compatible.
- Verify that your instantiation of PyTorch matches your hardware architecture.
- If running scripts, check file paths and permissions.
- Monitor system resources if you’re facing performance issues.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Further Exploration
Explore RAG showcases with Docling in standard LLM application frameworks through the following examples:
Conclusion
Docling is your go-to tool for PDF processing, offering an array of features for converting documents efficiently and effectively. As you delve into the world of Docling, remember this powerful tool can help streamline your work processes and enhance document management.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.