Very Large Datasets in PyTorch

In God we trust. All others must bring data. ~ W. Edwards Deming

Datasets that fit in memory

For simple machine learning problems, your PyTorch dataset class probably looks something like this:

class SimpleDataset(Dataset):
    def __init__(self, features, targets):
        self.features = []
        for feature in features:
            self.features.append(self._feature_transform(feature))
        self.targets = targets

    def _feature_transform(self, feature):
        # Optional feature transformation function which 
        # converts each feature into its input representation 
        # for the model. This might be an expensive operation, 
        # so its best to do now rather than during training.
        return some_transformation_fn(feature)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]

With this method, we basically load all of the data into RAM at once, which is perfectly fine for small datasets. But sooner or later you’re going to run into a machine learning problem with a large dataset. What do I mean by this? I mean a dataset which can’t easily fit into RAM/VRAM.

Datasets that fit on your hard drive

There are ways around this issue. One is using memory mapping. The idea is, rather than loading all of the data into memory, let’s store it on the disk and seek to it during training. This also means we can store the transformed tensors on the disk and load them as needed, rather than recomputing them every time we load data from disk.

Now we have a two step approach. The first step is transforming the features and targets to a memory-mapped file, and the second is loading it on-demand. For the first step we will create a list of memmap files, and for the second step we will use dask to virtually concatenate them together and read them as if they were a contiguous array.

import numpy as np
import uuid
import os

# store all features in memmap files
# you can do the same thing with targets

features_dir = 'features/'

for feature_file in feature_files:
    with open(feature_file) as f:
        features = transform_features(f.read())
    
    fname = f'{uuid.uuid4()}.memmap'

    # create the file on disk
    memmap_file = np.memmap(os.path.join(features_dir, fname), 
                            dtype='float64', 
                            mode='w+', 
                            shape=features.shape)

    # write features to memmap array
    memmap_file[:,:] = features

    # write changes to disk
    memmap_file.flush()
    

# Load the data from memmaps. We can use dask to create a 
# virtual contiguous array for simplicity.

import dask.array as da

class MemmapDataset(Dataset):
    def __init__(self, 
            feature_files, 
            target_files, 
            features_shape, 
            targets_shape):
        self.features = []
        self.targets = []
        
        for file in feature_files:
            self.features.append(np.memmap(file, 
                                           dtype='float64', 
                                           mode='r', 
                                           shape=features_shape))
        for file in target_files:
            self.targets.append(np.memmap(file, 
                                          dtype='float64', 
                                          mode='r', 
                                          shape=targets_shape))

        self.features = da.concatenate(self.features, axis=0)
        self.targets = da.concatenate(self.targets, axis=0)

    def __len__(self):
        return self.features.shape[0]

    def __getitem__(self, idx):
        return (
            torch.tensor(self.features[idx]), 
            torch.tensor(self.targets[idx])
        )

Sparse vectors

For very sparse vectors, where most of the items are zero, my recommendation is to store the nonzero indices of each vector rather than the whole thing. This is obviously an empirical determination, but it can require significantly less space to store the indices and build at train time than to store the entire vector. You can use a custom collate function to expand the training data during training:

def collate_fn(data):
    '''
    Here, data is a list of tuples returned from dataset[idx] 
    above. But supposing that "features" actually corresponds 
    to the nonzero indices of the features, we can expand the 
    tensor for training.

    This example is for a model which expects input dimension
    (bs, 1024) but you can adapt for your needs.
    '''
    feature_idxs, targets = zip(*data)
    bs = len(features)
    features = torch.zeros((bs * 1024))
    features[feature_idxs] = 1.0
    return features.view((bs, 1024)), targets

You then use the collate function like

dataset = MemmapDataset(...) # or SimpleDataset
dataloader = DataLoader(dataset, 
                        collate_fn=collate_fn, 
                        batch_size=bs)

Datasets that don’t fit on your hard drive

Sometimes the dataset is so big that it doesn’t fit on your hard drive. In that case, you might need to reorganize your code a bit, preprocessing your data into unique UUID-based shard files (sids), and training each sid one by one.

sids = [
    # you may have thousands of shard ids here
    '5c91ad9e-4963-4dfa-8885-6a351dd9fbb8',
    '8ecb542f-00df-468e-8f89-34a6608648d6',
    'f4536d05-a8ff-4591-9322-10860cd06942'
]

for sid in sids:
    dataset = ShardedDataset(sid)
    dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

    for i, batch in enumerate(dataloader):
        # do your training
        pass

And your Dataset class will look something like

class VeryLargeDataset(Dataset):
    def __init__(self, sid):
        self.data = self._load_shard(sid)

    def _load_shard(self, sid):
        # load from your hard drive or remote
        pass

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Despite the reorganization required, this is a very flexible approach and can scale to an effectively infinite amount of data. For continuous training, this is often a very effective approach.

Bonus: Use Rust/C Python bindings

Your feature transformation might be pretty slow depending on what you need to do. In my experience I was waiting 10 minutes to process text files in Python, which really slows down the development/training process. A much more effective approach can be to use Python bindings from a faster language, such as Rust or C.

Rust can even return native numpy arrays, making data transformations very efficient. For more on setting up Rust bindings, check out my how-to article.

import my_bindings

files_to_process = [...]
data = []
for file in files_to_process:
    data.append(my_bindings.process_file(file))