In the world of programming, converting documents between various formats can be quite the challenge. Thankfully, the docconv library in Go provides a robust solution for converting PDF, DOC, DOCX, XML, HTML, RTF, ODT, and more into plain text. In this article, we will dive into the installation process, usage, and some common troubleshooting steps to help you seamlessly integrate this tool into your projects.
Installation
Before we begin, ensure that you have Go installed on your machine. If you haven’t set it up yet, you can do so by visiting the Go installation guide.
Once you have Go ready, fetching and building docconv is as simple as executing the following command:
go install code.sajari.com/docconv/v2/docd@latest
Keep in mind that the full path to the executable should be included in your PATH environment variable to run it effortlessly.
Dependencies
To ensure smooth functioning, the following dependencies need to be installed based on your operating system:
- For Debian-based Linux:
sudo apt-get install poppler-utils wv unrtf tidy
brew install poppler-qt5 wv unrtf tidy-html5
Don’t forget to fetch the optional dependency for text extraction:
go get github.com/JalfResi/justext
Optional Dependencies for Image Support
If your project requires processing images, you can enhance the library by installing and building gosseract:
go get -tags ocr code.sajari.com/docconv/v2
On macOS, you might need to install Tesseract OCR via brew:
brew install tesseract
The docd Tool
The docd tool can operate in three distinct modes:
- As a service on port 8888 (default), allowing you to send documents as multipart POST requests.
- From within a Docker container with official images available at Docker Hub.
- Via the command line by passing document paths directly as arguments.
To start the service on a specific port, run:
docd -addr :8000
Example Usage
Here’s a quick analogy to help you better understand the functionality of the docconv library:
Imagine your library is an unstoppable wizard. This wizard can transform various scrolls (documents) into simple language (plain text). Just like you’d hand over a scroll to a wizard, you pass your PDF or DOCX files to docconv and receive back a text version in return!
Use Case 1: Run Locally
package main
import (
"fmt"
"code.sajari.com/docconv/v2"
)
func main() {
res, err := docconv.ConvertPath("your-file.pdf")
if err != nil {
// TODO: handle error
}
fmt.Println(res)
}
Use Case 2: Request Over the Network
package main
import (
"fmt"
"code.sajari.com/docconv/v2/client"
)
func main() {
// Create a new client, using the default endpoint
c := client.New()
res, err := client.ConvertPath(c, "your-file.pdf")
if err != nil {
// TODO: handle error
}
fmt.Println(res)
}
You can also use a simple curl command:
curl -s -F input=@your-file.pdf http://localhost:8888/convert
Troubleshooting
If you encounter any issues while using docconv, consider the following troubleshooting steps:
- Ensure that all dependencies are properly installed.
- Verify that the executable file location is included in your PATH environment variable.
- Check for any spelling mistakes in file paths during execution.
- If you face difficulties while setting up Docker, consult the Docker installation documentation for your operating system.
For further assistance, feel free to visit fxis.ai. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.