加载数据集（Dataset、DataLoader、Sampler）、pin_memory、num

本文链接：https://blog.csdn.net/weixin_43135178/article/details/124673732

Dataset、DataLoader、Sampler之间的关系

我们可以假设我们的数据是一组图像，每一张图像对应一个index，那么如果我们要读取数据就只需要对应的index即可，即代码中的indices，而选取index的方式有多种，有按顺序的，也有乱序的，所以这个工作需要Sampler完成，那么Dataset和DataLoader在什么时候产生关系呢？我们已经通过Dataset拿到了index，那么下一步我们只需要根据index对数据进行读取即可了。

1、怎么实现自定义Dataset 类以供后面的dataloader使用它？

自定义Dataset 类首先需要继承Dataset 类，由于Dataset 是一个抽象类，我们不能实例化，所以我们要继承然后实现它，这样就可以实例化我们自己的datasets 类了

我们需要实现继承的Dataset 类的三个方法： __init__(self, filepath)、 __getitem__(self, index)、 __len__(self)

from torch.utils.data import Dataset
from torch.utils.data import DataLoader
class MyDataset(Dataset):
    def __init__(self, filepath):
        # 一般是定义在该类中其他方法会用到的属性
        pass
    def __getitem__(self, index):
        # 接收一个索引，返回数据的索引对应的数据
        pass
    def __len__(self):
        # 返回数据的长度，用于Dataloader中的index选取范围
        pass

Dataset小例子

初始化Daset

import torch
from torch.utils import data
class MyDataset(torch.utis.data.Dataset):
    def __init__(self):
        self.data = torch.arange(0, 20)
    
    def __getitem__(self,index):
        x = self.data[index]
        y=x*2
        return y
    
    def __len__(self):
        return len(self.data)

# 调用“ __init__ ”方法
dataset = MyDataset()

# 调用“ __len__ ”方法
print(len(dataset))

# 调用“ __getitem__ ”方法
print(dataset[3])

## 定义并使用dataloader
dataloader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=4)
# len(dataset) / batch_size = 20/4 = 5
print(len(dataloader))


# for 循环直接就迭代的选出的datasets中的元素【自动的得到index】
# 相当于隐式的调用了__getitem__方法
for x in dataloader:
    print(x)

# tensor([ 8, 12, 22,  0])
# tensor([28, 30, 36,  2])
# tensor([ 6, 10,  4, 34])
# tensor([32, 20, 24, 16])
# tensor([18, 26, 14, 38])

理解pin_memory 与 num_workers

pin_memory就是锁页内存，创建DataLoader时，设置pin_memory=True，则意味着生成的Tensor数据最开始是属于内存中的锁页内存，这样将内存的Tensor转义到GPU的显存就会更快一些。

主机中的内存，有两种存在方式，一是锁页，二是不锁页，锁页内存存放的内容在任何情况下都不会与主机的虚拟内存进行交换（注：虚拟内存就是硬盘），而不锁页内存在主机内存不足时，数据会存放在虚拟内存中。

而显卡中的显存全部是锁页内存！

当计算机的内存充足的时候，可以设置pin_memory=True。当系统卡住，或者交换内存使用过多的时候，设置pin_memory=False。因为pin_memory与电脑硬件性能有关，pytorch开发者不能确保每一个炼丹玩家都有高端设备，因此pin_memory默认为False。

但是对于我们来说，非常建议在dataloader的时候将pin_memory默认为True、num_workers设置为主机的核心数，这样可以保证数据读取的速度不会拖累GPU。

pytorch创建data.DataLoader时，参数pin_memory的理解_using pin_memory on device 0-CSDN博客

DataLoader基础用法：

API

 def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None,
                 batch_sampler=None, num_workers=0, collate_fn=None,
                 pin_memory=False, drop_last=False, timeout=0,
                 worker_init_fn=None, multiprocessing_context=None):

其中几个常用的参数

dataset 数据集，map-style and iterable-style 可以用index取值的对象、
batch_size 大小
shuffle 取batch是否随机取，默认为False
sampler 定义取batch的方法，是一个迭代器，每次生成一个key 用于读取dataset中的值（一般是在使用分布式的时候才会使用这个sampler）
batch_sampler 也是一个迭代器，每次生次一个batch_size的key
num_workers 参与工作的线程数，建议将num_workers设置为主机的核心数
collate_fn 可以对batch中的数据做额外的处理
pin_memory 生成的Tensor数据最开始是属于内存中的锁页内存，这样将内存的Tensor转义到GPU的显存就会更快一些。如果电脑内存较大，建议设置为True。
drop_last 对最后不足batchsize的数据的处理方法

基础用法

DataLoader接收5个参数：分别是实例化后 dataset = Dataset()、batch_size、shuffle、sampler 、num_workers

train_loader = DataLoader( dataset =dataset, batch_size = 32 , shuffle = True , sampler =None, num_workers = 2 )

DataLoader具体的执行过程如下图所示：（将相当于将dataset的每个epoch把所有的样本shuffle，然后每N个样本为一个batch_size【下面以N为2为例】）

返回值：

返回值和Dataset的__getitem__的写法有关，不过一般来说，DataLoader的返回值是“数据” + “label”

for inputs,label in dataloader:
    print(inputs,labels)

一般情况下遍历dataloader其实就是输出一个batch的图像（batch-size, 3, 256, 256）和对应的label（batch-size, ）

2、实现DataLoader并使用的方式：

只需要将Dataset实例化，然后传入DataLoader并设置相关的参数就可以了

dataset = DiabetesDataset(filepath='xxx')
dataloader = DataLoader(dataset=dataset, batch_size=32, shuffle=True, num_workers=0)
for batch in dataloader:
    # 在这里进行训练等操作
    print(batch)

Sampler采样器

首先需要知道的是所有的采样器都继承自Sampler这个类，如下：

可以看到主要有三种方法：分别是：

__init__: 这个很好理解，就是初始化
__iter__: 这个是用来产生迭代索引值的，也就是指定每个step需要读取哪些数据
__len__: 这个是用来返回每次迭代器的长度

class Sampler(object):
    r"""Base class for all Samplers.
    Every Sampler subclass has to provide an __iter__ method, providing a way
    to iterate over indices of dataset elements, and a __len__ method that
    returns the length of the returned iterators.
    """
    # 一个 迭代器 基类
    def __init__(self, data_source):
        pass

    def __iter__(self):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

子类Sampler

介绍完父类后我们看看Pytorch给我们提供了哪些采样器

1）、SequentialSampler 顺序采样

这个看名字就很好理解，其实就是按顺序对数据集采样。

其原理是首先在初始化的时候拿到数据集data_source，之后在__iter__方法中首先得到一个和data_source一样长度的range可迭代器。每次只会返回一个索引值。

class SequentialSampler(Sampler):
    r"""Samples elements sequentially, always in the same order.
    Arguments:
        data_source (Dataset): dataset to sample from
    """
   # 产生顺序 迭代器
    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))

    def __len__(self):
        return len(self.data_source)

使用示例：

a = [1,5,78,9,68]
b = torch.utils.data.SequentialSampler(a)
for x in b:
    print(x)
    
>>> 0
    1
    2
    3
    4

2）RandomSampler 随机采样

参数作用：

data_source: 同上
num_samples: 指定采样的数量，默认是所有。
replacement: 若为True，则表示可以重复采样，即同一个样本可以重复采样，这样可能导致有的样本采样不到。所以此时我们可以设置num_samples来增加采样数量使得每个样本都可能被采样到。

class RandomSampler(Sampler):
    r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
    If with replacement, then user can specify ``num_samples`` to draw.
    Arguments:
        data_source (Dataset): dataset to sample from
        num_samples (int): number of samples to draw, default=len(dataset)
        replacement (bool): samples are drawn with replacement if ``True``, default=False
    """

    def __init__(self, data_source, replacement=False, num_samples=None):
        self.data_source = data_source
        self.replacement = replacement
        self.num_samples = num_samples

        if self.num_samples is not None and replacement is False:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if self.num_samples is None:
            self.num_samples = len(self.data_source)

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError("num_samples should be a positive integeral "
                             "value, but got num_samples={}".format(self.num_samples))
        if not isinstance(self.replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(self.replacement))

    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
        return iter(torch.randperm(n).tolist())

    def __len__(self):
        return len(self.data_source)

3）SubsetRandomSampler（比较常用）

class SubsetRandomSampler(Sampler):
    r"""Samples elements randomly from a given list of indices, without replacement.
    Arguments:
        indices (sequence): a sequence of indices
    """

    def __init__(self, indices):
        self.indices = indices

    def __iter__(self):
        return (self.indices[i] for i in torch.randperm(len(self.indices)))

    def __len__(self):
        return len(self.indices)

这个采样器常见的使用场景是将训练集划分成训练集、验证集、测试集，示例如下：

n_train = len(train_dataset)
split = n_train // 3
indices = random.shuffle(list(range(n_train)))
train_sampler = torch.utils.data.sampler.SubsetRandomSampler(indices[split:])
valid_sampler = torch.utils.data.sampler.SubsetRandomSampler(indices[:split])
train_loader = DataLoader(..., sampler=train_sampler, ...)
valid_loader = DataLoader(..., sampler=valid_sampler, ...)

4、WeightedRandomSampler

参数作用同上面的RandomSampler，不再赘述。

class WeightedRandomSampler(Sampler):
    r"""Samples elements from [0,..,len(weights)-1] with given probabilities (weights).
    Arguments:
        weights (sequence)   : a sequence of weights, not necessary summing up to one
        num_samples (int): number of samples to draw
        replacement (bool): if ``True``, samples are drawn with replacement.
            If not, they are drawn without replacement, which means that when a
            sample index is drawn for a row, it cannot be drawn again for that row.
    """

    def __init__(self, weights, num_samples, replacement=True):
        if not isinstance(num_samples, _int_classes) or isinstance(num_samples, bool) or \
                num_samples <= 0:
            raise ValueError("num_samples should be a positive integeral "
                             "value, but got num_samples={}".format(num_samples))
        if not isinstance(replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(replacement))
        self.weights = torch.tensor(weights, dtype=torch.double)
        self.num_samples = num_samples
        self.replacement = replacement

    def __iter__(self):
        return iter(torch.multinomial(self.weights, self.num_samples, self.replacement).tolist())

    def __len__(self):
        return self.num_samples  ## 指的是一次一共采样的样本的数量

5、BatchSampler

前面的采样器每次都只返回一个索引，但是我们在训练时是对批量的数据进行训练，而这个工作就需要BatchSampler来做。也就是说BatchSampler的作用就是将前面的Sampler采样得到的索引值进行合并，当数量等于一个batch大小后就将这一批的索引值返回。

class BatchSampler(Sampler):
    r"""Wraps another sampler to yield a mini-batch of indices.
    Args:
        sampler (Sampler): Base sampler.
        batch_size (int): Size of mini-batch.
        drop_last (bool): If ``True``, the sampler will drop the last batch if
            its size would be less than ``batch_size``
    Example:
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
    """
# 批次采样
    def __init__(self, sampler, batch_size, drop_last):
        if not isinstance(sampler, Sampler):
            raise ValueError("sampler should be an instance of "
                             "torch.utils.data.Sampler, but got sampler={}"
                             .format(sampler))
        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size <= 0:
            raise ValueError("batch_size should be a positive integeral value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

使用方法：

import torch

a = [1,5,78,9,68]
sample = torch.utils.data.SequentialSampler(a)
for x in b:
    print(x)
    
torch.utils.data.BatchSampler(sample, batch_size=1, drop_last=True)

6、DistributedBatchSampler

和BatchSampler使用方法一致，只是变成分布式的

分布式时使用sampler

if args.distributed:
    sampler_train = torch.utils.data.DistributedSampler(
        dataset, num_replicas=num_tasks, rank=global_rank, shuffle=True
    )
    print("Sampler_train = %s" % str(sampler_train))

# 使用分布式定义好的sampler
graph_data_loader = DataLoader(dataset, sampler=sampler_train, batch_size=args.batch_size, shuffle=True, num_workers=16, pin_memory=True, collate_fn=mol_collator)

https://www.cnblogs.com/marsggbo/p/11308889.html

Pytorch Sampler详解 - 知乎