How to Import Open Library Data into PostgreSQL

Mar 30, 2022 | Programming

homemayankDocumentsarticle-generation-using-llmresized_images_gitsqlreadme_LibrariesHacked_openlibrary-search

Welcome to a comprehensive guide on how to import data from the Open Library into a PostgreSQL database. This step-by-step tutorial will make tackling this project as easy as pie, with clear instructions and troubleshooting advice. Let’s dive in!

Getting Started

Before you can swim in the ocean of bibliographic data from Open Library, you’ll need to set up your environment. Follow these straightforward steps:

1. Install the required prerequisites so that you have a database server.
2. Download the data from Open Library.
3. Run the processing the data scripts to make it easier to import.
4. Import the data into the database.

Prerequisites

Here are the essential tools you’ll need to have ready:

Python 3 – Tested with version 3.10
PostgreSQL – Version 15 is tested, but all recent versions should work.
Disk Space – The data files are large, and the uncompressed editions file is around 45GB. Ensure you have at least 250GB of free space for a smooth import experience.

Downloading the Data

The Open Library provides bulk downloads that are regularly updated. These can be found on their data dumps page. Available downloads include:

Editions (~9GB)
Works (~2.5GB)
Authors (~0.5GB)
All types (~10GB)

To download the specific data, you can use the following command in your console:

wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz -P ~/downloads
wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz -P ~/downloads

After downloading, move the files to your processing directory:

mv ~/downloads/ol_dump_authors_*txt.gz ./data/unprocessed/
mv ~/downloads/ol_dump_works_*txt.gz ./data/unprocessed/
mv ~/downloads/ol_dump_editions_*txt.gz ./data/unprocessed/

Then uncompress the data files using:

gzip -d -c data/unprocessed/ol_dump_editions.txt.gz > data/unprocessed/ol_dump_editions.txt
gzip -d -c data/unprocessed/ol_dump_works.txt.gz > data/unprocessed/ol_dump_works.txt
gzip -d -c data/unprocessed/ol_dump_authors.txt.gz > data/unprocessed/ol_dump_authors.txt

Processing the Data

Here’s where the fun begins! The raw data may not directly import into PostgreSQL because the structure of the columns can vary. Think of this process like cleaning up a messy garage—using a little bit of elbow grease to create organized, tidy spaces for the treasures you’ll unearth later.

To clean the data, use the Python script that writes out the data, filtering for correct column counts:

python openlibrary_data_process.py

Due to the size of the files, you can also opt for the chunked processing script:

python openlibrary_data_process_chunked.py

This script will create smaller files that can be loaded sequentially, saving you from running into storage issues. After processing, store your cleaned files in the data/processed folder.

Import into Database

Now that your data is ready, it’s time to bring it into PostgreSQL. It’s as easy as pie with the provided database scripts. Start by creating the necessary database and tables:

psql --set=sslmode=require -f openlibrary-db.sql -h localhost -p 5432 -U username postgres

Database Details & Querying the Data

Once the database is populated, you’ll have access to five main tables consisting of:

Authors: Individuals who write the works.
Works: The creations by the authors, along with title and subtitle details.
Author Works: A linking table between works and authors.
Editions: Specific editions of the works, including ISBNs.
Edition ISBNs: The identifiers for the editions.

You can now execute SQL queries to extract insights from your data. For example, to get details for a specific item using its ISBN:

SELECT 
    e.data-title AS EditionTitle,
    w.data-title AS WorkTitle,
    a.data-name AS Name,
    e.data-subtitle AS EditionSubtitle,
    w.data-subtitle AS WorkSubtitle,
    e.data-subjects AS Subjects,
    e.data-description-value AS EditionDescription,
    w.data-description-value AS WorkDescription,
    e.data-notes-value AS EditionNotes,
    w.data-notes-value AS WorkNotes
FROM editions e
JOIN edition_isbns ei ON ei.edition_key = e.key
JOIN works w ON w.key = e.work_key
JOIN author_works a_w ON a_w.work_key = w.key
JOIN authors a ON a_w.author_key = a.key
WHERE ei.isbn = '9781551922461';

Troubleshooting

During this process, you may encounter a few bumps along the road. Here are some common issues and their solutions:

Insufficient Disk Space: Make sure you have at least 250GB of free space for data imports. Consider deleting unnecessary files.
Data Import Errors: If you run into errors during the import, it may be due to the data file format. Ensure you’ve processed the files correctly as outlined.
PostgreSQL Connection Issues: Double-check your database connection details, such as username, hostname, and port number.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox