How to Use StreamingDataset for Efficient Data Training

Jun 19, 2023 | Data Science

Welcome to a comprehensive guide on using StreamingDataset by MosaicML, designed to transform your data training tasks by offering fast, accurate streaming of training data directly from cloud storage. This guide will walk you through the setup, usage, and troubleshooting of StreamingDataset to get you up and running in no time!

Why Choose StreamingDataset?

StreamingDataset allows you to efficiently train on large datasets, regardless of their location, maximizing speed and resource efficiency. Imagine you’re a chef who wants to prepare a gourmet meal using fresh ingredients from different farmers’ markets. Instead of waiting for all your ingredients to come to your kitchen, you get them delivered just as you need them. This is exactly how StreamingDataset works—streaming your data when you need it, rather than storing it all locally beforehand!

Getting Started with StreamingDataset

1. Installation

To install StreamingDataset, simply use pip:

pip install mosaicml-streaming

2. Prepare Your Data

Before you can start streaming, you need to convert your raw datasets into supported formats. Here’s how:

import numpy as np
from PIL import Image
from streaming import MDSWriter

data_dir = "path-to-dataset"
columns = {
    "image": "jpeg",
    "class": "int"
}
compression = "zstd"

with MDSWriter(out=data_dir, columns=columns, compression=compression) as out:
    for i in range(10000):
        sample = {
            "image": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
            "class": np.random.randint(10)
        }
        out.write(sample)

3. Upload Your Data to Cloud Storage

After preparing the data, upload it to your preferred cloud storage service. For example, to upload a directory to an S3 bucket, you can use:

aws s3 cp --recursive path-to-dataset s3://my-bucket/path-to-dataset

4. Build a StreamingDataset and DataLoader

Now, you can construct a StreamingDataset using the uploaded data. Here’s how to set everything up:

from torch.utils.data import DataLoader
from streaming import StreamingDataset

remote = "s3://my-bucket/path-to-dataset"
local = "tmp/path-to-dataset"

dataset = StreamingDataset(local=local, remote=remote, shuffle=True)
sample = dataset[1337]  # Accessing sample
img = sample['image']
cls = sample['class']

dataloader = DataLoader(dataset)

Key Features of StreamingDataset

  • Seamless data mixing: Easily combine datasets and control sampling proportions.
  • True Determinism: Maintain sample order regardless of the number of GPUs or nodes.
  • Instant mid-epoch resumption: Resume training without extensive delays after interruptions.
  • High throughput: Achieve lower sample latency with efficient data handling.

Troubleshooting

If you encounter issues during the implementation of StreamingDataset, here are some troubleshooting steps:

  • Data not loading: Ensure your cloud storage paths are accurate and accessible.
  • Errors in data preparation: Double-check the formats and types of data being processed.
  • Performance issues: Monitor your cloud storage service for bandwidth limitations.

For further assistance, for more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Further Resources

For more detailed guides, examples, and tutorials, check the following links:

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox