python Pytorch:如何获取子集的所有数据和目标

zsohkypk  于 2023-04-19  发布在  Python
关注(0)|答案(3)|浏览(101)

我使用以下代码从特定文件夹中读取数据集,并将其划分为训练和测试子集。我可以使用列表解析获得每个子集的所有数据和目标,但对于大数据来说非常慢。是否有其他快速方法可以做到这一点?

def train_test_dataset(dataset, test_split=0.20):
    train_idx, test_idx = train_test_split(list(range(len(dataset))), test_size=test_split, stratify=dataset.targets)
    datasets = {}
    train_dataset = Subset(dataset, train_idx)
    test_dataset = Subset(dataset, test_idx)

    return train_dataset, test_dataset

dataset = dset.ImageFolder("/path_to_folder", transform = transform)
    
train_set, test_set = train_test_dataset(dataset)

train_data = [data for data, _ in train_set]
train_labels = [label for _, label in train_set]

我尝试过使用DataLoader这种方法,它更好,但也需要一些时间:PyTorch Datasets: Converting entire Dataset to NumPy
谢谢大家。

relj7zay

relj7zay1#

您提供的链接中的answer基本上违背了拥有数据加载器的目的:数据加载器是用来将你的数据一个块一个块地加载到内存中的。2这有一个明显的优点,那就是不必在给定的时刻加载整个数据集。
ImageFolder数据集中,您可以使用torch.utils.data.random_split函数拆分数据:

>>> def train_test_dataset(dataset, test_split=.2):
...    test_len = int(len(dataset)*test_split)
...    train_len = len(dataset) - test_len 
...    return random_split(dataset, [train_len, test_len])

然后,您可以将这些数据集插入到单独的DataLoaders中:

>>> train_set, test_set = train_test_dataset(dataset)

>>> train_dl = DataLoader(train_set, batch_size=16, shuffle=True)
>>> test_dl  = DataLoader(train_set, batch_size=32 shuffle=False)
lmvvr0a8

lmvvr0a82#

一种不需要为每个子集重新创建数据加载器的更简单的方法是使用Subset的getitemlen方法。类似于:

train_data = train_data.__getitem__([i for i in range(0,train_data.__len__())])[0]
train_labels = train_labels.__getitem__([i for i in range(0,train_labels.__len__())])[0]
8fsztsew

8fsztsew3#

对于标准图像数据集,您可以利用1.整个数据集的Subset.dataset访问器,2.子集的Subset.indices索引存储,以及3.整个数据集的__getitem__逻辑,您可以通过from inspect import getsource; getsource(Subset.dataset.__getitem__)访问(其中Subset应该是您实际的数据集对象)。

# Setting up the splits
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torch import Generator
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor

BATCH_SIZE = 64

# Download and load the training data
dataset_train = FashionMNIST('./data', download=True, train=True, transform=ToTensor())
print(f'Before splitting the full train set into train and valid: len(dataset_train)={len(dataset_train)}')

size_valid = 5000
size_train = len(dataset_train) - size_valid
dataset_train, dataset_valid = random_split(dataset_train, [size_train, size_valid], generator=Generator().manual_seed(SEED))
print(f'After splitting the full train set into train and valid: len(dataset_train)={len(dataset_train)}, len(dataset_valid)={len(dataset_valid)}')
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=BATCH_SIZE, shuffle=False)

# Download and load the test data
dataset_test = FashionMNIST('./data', download=True, train=False, transform=ToTensor())
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False)

# Accessing the X and y of a split Subset
X_valid = dataset_valid.dataset.data[dataset_valid.indices]
y_valid = dataset_valid.dataset.targets[dataset_valid.indices].long()

相关问题