How to Use Petastorm for Deep Learning Model Datasets

May 5, 2024 | Data Science

Welcome to a dive into Petastorm, an open-source data access library developed at Uber ATG! This library is designed to seamlessly enable the training and evaluation of deep learning models from datasets housed in Apache Parquet format. If you’re ready to bridge the data divide between your models and large datasets, follow along!

Installation

Installing Petastorm is a breeze. Use the following pip command:

pip install petastorm

In addition, there are extra dependencies you might want to include:

tf – for TensorFlow compatibility
tf_gpu – for GPU support with TensorFlow
torch – for PyTorch
opencv – for image processing
docs – to generate documentation
test – to run tests

For instance, to install the GPU version of TensorFlow alongside OpenCV, you can use:

pip install petastorm[opencv,tf_gpu]

Creating a Dataset

Imagine you are a chef preparing a feast. You wouldn’t just toss ingredients into a pot; you would carefully measure, chop, and season each element. In the same way, creating a dataset with Petastorm involves stitching together various components such as schema and data generation.

Here’s a basic example of how you can create a Petastorm dataset using PySpark:


import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField

# Define schema for the dataset
HelloWorldSchema = Unischema('HelloWorldSchema', [
    UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
    UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),
    UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),
])

def row_generator(x):
    return {
        'id': x,
        'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),
        'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))
    }

def generate_petastorm_dataset(output_url='file:///tmp/hello_world_dataset'):
    spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()
    rows_count = 10
    with materialize_dataset(spark, output_url, HelloWorldSchema):
        rows_rdd = spark.sparkContext.parallelize(range(rows_count)).map(row_generator).map(lambda x: dict_to_spark_row(HelloWorldSchema, x))
        spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()).coalesce(10).write.mode('overwrite').parquet(output_url)

In this example, we prepared everything meticulously just like a chef would, defining the schema like ingredients, generating rows of data like preparing meals, and then combining them into the dataset.

Reading the Dataset

Petastorm provides a simple way to read datasets using its Reader class:


from petastorm import make_reader

with make_reader('file:///path/to/your/dataset') as reader:
    for row in reader:
        print(row)

This is like inviting your guests to savor the meal you’ve prepared—each row represents a flavorful entry into your dataset.

Troubleshooting

Sometimes, even after following the recipe, things might go awry. Here are some troubleshooting tips:

Verify that your Python environment includes all necessary dependencies.
Ensure your Spark session is configured correctly to handle the dataset.
If you encounter an error with reading the dataset, double-check the path and ensure the dataset was generated properly.
For detailed insights, consider reviewing the issues page for similar problems.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

How to Use Petastorm for Deep Learning Model Datasets

Installation

Creating a Dataset

Reading the Dataset

Troubleshooting

Let’s Build Success Together