A Comprehensive Guide to Using the docconv Go Library

Jun 14, 2024 | Programming

homemayankDocumentsarticle-generation-using-llmresized_images_githtmlreadme_sajari_docconv

In the world of programming, converting documents between various formats can be quite the challenge. Thankfully, the docconv library in Go provides a robust solution for converting PDF, DOC, DOCX, XML, HTML, RTF, ODT, and more into plain text. In this article, we will dive into the installation process, usage, and some common troubleshooting steps to help you seamlessly integrate this tool into your projects.

Installation

Before we begin, ensure that you have Go installed on your machine. If you haven’t set it up yet, you can do so by visiting the Go installation guide.

Once you have Go ready, fetching and building docconv is as simple as executing the following command:

go install code.sajari.com/docconv/v2/docd@latest

Keep in mind that the full path to the executable should be included in your PATH environment variable to run it effortlessly.

Dependencies

To ensure smooth functioning, the following dependencies need to be installed based on your operating system:

For Debian-based Linux:

sudo apt-get install poppler-utils wv unrtf tidy

For macOS:

brew install poppler-qt5 wv unrtf tidy-html5

Don’t forget to fetch the optional dependency for text extraction:

go get github.com/JalfResi/justext

Optional Dependencies for Image Support

If your project requires processing images, you can enhance the library by installing and building gosseract:

go get -tags ocr code.sajari.com/docconv/v2

On macOS, you might need to install Tesseract OCR via brew:

brew install tesseract

The docd Tool

The docd tool can operate in three distinct modes:

As a service on port 8888 (default), allowing you to send documents as multipart POST requests.
From within a Docker container with official images available at Docker Hub.
Via the command line by passing document paths directly as arguments.

To start the service on a specific port, run:

docd -addr :8000

Example Usage

Here’s a quick analogy to help you better understand the functionality of the docconv library:

Imagine your library is an unstoppable wizard. This wizard can transform various scrolls (documents) into simple language (plain text). Just like you’d hand over a scroll to a wizard, you pass your PDF or DOCX files to docconv and receive back a text version in return!

Use Case 1: Run Locally

package main

import (
    "fmt"
    "code.sajari.com/docconv/v2"
)

func main() {
    res, err := docconv.ConvertPath("your-file.pdf")
    if err != nil {
        // TODO: handle error
    }
    fmt.Println(res)
}

Use Case 2: Request Over the Network

package main

import (
    "fmt"
    "code.sajari.com/docconv/v2/client"
)

func main() {
    // Create a new client, using the default endpoint
    c := client.New()
    res, err := client.ConvertPath(c, "your-file.pdf")
    if err != nil {
        // TODO: handle error
    }
    fmt.Println(res)
}

You can also use a simple curl command:

curl -s -F input=@your-file.pdf http://localhost:8888/convert

Troubleshooting

If you encounter any issues while using docconv, consider the following troubleshooting steps:

Ensure that all dependencies are properly installed.
Verify that the executable file location is included in your PATH environment variable.
Check for any spelling mistakes in file paths during execution.
If you face difficulties while setting up Docker, consult the Docker installation documentation for your operating system.

For further assistance, feel free to visit fxis.ai. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox