Welcome to a comprehensive guide on using StreamingDataset by MosaicML, designed to transform your data training tasks by offering fast, accurate streaming of training data directly from cloud storage. This guide will walk you through the setup, usage, and troubleshooting of StreamingDataset to get you up and running in no time!
Why Choose StreamingDataset?
StreamingDataset allows you to efficiently train on large datasets, regardless of their location, maximizing speed and resource efficiency. Imagine you’re a chef who wants to prepare a gourmet meal using fresh ingredients from different farmers’ markets. Instead of waiting for all your ingredients to come to your kitchen, you get them delivered just as you need them. This is exactly how StreamingDataset works—streaming your data when you need it, rather than storing it all locally beforehand!
Getting Started with StreamingDataset
1. Installation
To install StreamingDataset, simply use pip:
pip install mosaicml-streaming
2. Prepare Your Data
Before you can start streaming, you need to convert your raw datasets into supported formats. Here’s how:
import numpy as np
from PIL import Image
from streaming import MDSWriter
data_dir = "path-to-dataset"
columns = {
"image": "jpeg",
"class": "int"
}
compression = "zstd"
with MDSWriter(out=data_dir, columns=columns, compression=compression) as out:
for i in range(10000):
sample = {
"image": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),
"class": np.random.randint(10)
}
out.write(sample)
3. Upload Your Data to Cloud Storage
After preparing the data, upload it to your preferred cloud storage service. For example, to upload a directory to an S3 bucket, you can use:
aws s3 cp --recursive path-to-dataset s3://my-bucket/path-to-dataset
4. Build a StreamingDataset and DataLoader
Now, you can construct a StreamingDataset using the uploaded data. Here’s how to set everything up:
from torch.utils.data import DataLoader
from streaming import StreamingDataset
remote = "s3://my-bucket/path-to-dataset"
local = "tmp/path-to-dataset"
dataset = StreamingDataset(local=local, remote=remote, shuffle=True)
sample = dataset[1337] # Accessing sample
img = sample['image']
cls = sample['class']
dataloader = DataLoader(dataset)
Key Features of StreamingDataset
- Seamless data mixing: Easily combine datasets and control sampling proportions.
- True Determinism: Maintain sample order regardless of the number of GPUs or nodes.
- Instant mid-epoch resumption: Resume training without extensive delays after interruptions.
- High throughput: Achieve lower sample latency with efficient data handling.
Troubleshooting
If you encounter issues during the implementation of StreamingDataset, here are some troubleshooting steps:
- Data not loading: Ensure your cloud storage paths are accurate and accessible.
- Errors in data preparation: Double-check the formats and types of data being processed.
- Performance issues: Monitor your cloud storage service for bandwidth limitations.
For further assistance, for more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Further Resources
For more detailed guides, examples, and tutorials, check the following links:

