Fasterrcnn的backbone部分
以resnet50为backbone并且加入了fpn网络,关于fpn就是特征金字塔,简单来说就是融合不同尺度的特征进行检测,既加入底层卷积特征的位置信息对检测小目标很有用,又融合高层卷积特征的语义信息,关于具体实现过程可以看下面的代码分析。
def resnet50_fpn_backbone():
resnet_backbone = ResNet(Bottleneck, [3, 4, 6, 3],
include_top=False)
# 冻结layer1及其之前的所有底层权重(基础通用特征)
for name, parameter in resnet_backbone.named_parameters():
if 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
parameter.requires_grad_(False)
return_layers = {'layer1': '0', 'layer2': '1', 'layer3': '2', 'layer4': '3'}
# in_channel 为layer4的输出特征矩阵channel = 2048
in_channels_stage2 = resnet_backbone.in_channel // 8
in_channels_list = [
in_channels_stage2, # layer1 out_channel=256
in_channels_stage2 * 2, # layer2 out_channel=512
in_channels_stage2 * 4, # layer3 out_channel=1024
in_channels_stage2 * 8, # layer4 out_channel=2048
]
out_channels = 256
return BackboneWithFPN(resnet_backbone, return_layers, in_channels_list, out_channels)
可以看到这里是选用的resnet50做为backbone,return_layers字典的意思是使用resnet50网络中layer1,layer2,layer3,layer4层的输出结果做检测。in_channels_list指layer1到4的输出通道数,out_channels指希望把layer1到4的输出都变为256
然后再看到BackbonewithFPN类
class BackboneWithFPN(nn.Module):
def __init__(self, backbone, return_layers, in_channels_list, out_channels):
super(BackboneWithFPN, self).__init__()
self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
self.fpn = FeaturePyramidNetwork(
in_channels_list=in_channels_list,
out_channels=out_channels,
extra_blocks=LastLevelMaxPool(),
)
self.out_channels = out_channels
def forward(self, x):
x = self.body(x)
x = self.fpn(x)
return x
BackbonewithFPN定义了一个body和fpn
body使用IntermediaLayerGetter类得到resnet layer1到layer4层的输出
fpn使用FeaturepyramidNetwork类得到fpn之后的有序字典输出。
再来看到IntermediaLayerGetter类
class IntermediateLayerGetter(nn.ModuleDict):
def __init__(self, model, return_layers):
if not set(return_layers).issubset([name for name, _ in model.named_children()]):
raise ValueError("return_layers are not present in model")
orig_return_layers = return_layers
return_layers = {k: v for k, v in return_layers.items()}
layers = OrderedDict()
# 遍历模型子模块按顺序存入有序字典
# 只保存layer4及其之前的结构,舍去之后不用的结构
for name, module in model.named_children():
layers[name] = module
if name in return_layers:
del return_layers[name]
# 如果return_layers是空的
if not return_layers:
break
super(IntermediateLayerGetter, self).__init__(layers)
self.return_layers = orig_return_layers
def forward(self, x):
out = OrderedDict()
# 依次遍历模型的所有子模块,并进行正向传播,
# 收集layer1, layer2, layer3, layer4的输出
for name, module in self.named_children():
x = module(x)
if name in self.return_layers:
out_name = self.return_layers[name]
out[out_name] = x
return out
可以看到IntermediateLayerGetter类就是把模型在return_layers中的层和return_layers之前的层加入layers字典中,使用layers字典初始化一个ModuleDict,然后遍历该模型的所有子模块并进行正向传播,收集得到layer1到layer4的输出。
经过debug调试可以看到在这里IntermediateLayerGetter的输出结果为(因为后面的调试可能跟这里不是一次调试,加载进模型的图片不一样所以可能导致debug结果有出入,如果是一次debug是完全一样的)
out:{
“0”:(batch, 256, 232, 336)
“1”:(batch, 512, 116, 168)
“2”:(batch, 1024, 58, 84)
“3”:(batch, 2048, 29, 42)
}
得到了resnet layer1到layer4的结果之后,再来看到featurepyramidnetwork类
特征金字塔插图
def forward(self, x):
names = list(x.keys())
x = list(x.values())
# 将resnet layer4的channel调整到指定的out_channels
last_inner = self.inner_blocks[-1](x[-1])
# result中保存着每个预测特征层
results = []
# 将layer4调整channel后的特征矩阵,通过3x3卷积后得到对应的预测特征矩阵
results.append(self.layer_blocks[-1](last_inner))
for idx in range(len(x) - 2, -1, -1):
inner_lateral = self.get_result_from_inner_blocks(x[idx], idx)
feat_shape = inner_lateral.shape[-2:]
inner_top_down = F.interpolate(last_inner, size=feat_shape, mode="nearest")
last_inner = inner_lateral + inner_top_down
results.insert(0, self.get_result_from_layer_blocks(last_inner, idx))
# 在layer4对应的预测特征层基础上生成预测特征矩阵5
if self.extra_blocks is not None:
results, names = self.extra_blocks(results, names)
# make it back an OrderedDict
out = OrderedDict([(k, v) for k, v in zip(names, results)])
return out
这里我们直接看到featurepyramidnetwork类的forward函数
self.inner_blocks是卷积层列表,是为了把不同的layer层的输出通道数都转化为需要的通道数256
[(256,256,1),(512,256,1),(1024,256,1),(2048,256,1)]
(256,256,1)中,第一个256代表输入channel,第二份256代表输出channel,1代表1✖️1卷积。
self.layer_blocks也是卷积层列表,是为了把转为相同通道的不同层的输出做3*3卷积得到相同通道的输出再进行fpn
[(256,256,3),(256,256,3),(256,256,3),(256,256,3)]
inner_blocks和layer_blocks列表中的四个卷积核分别代表backbone输出的四个层分别要做的卷积
forward函数的输入是x,x就是之前IntermediateLayerGetter得到的输出out有序字典,首先对x的最后一层进行处理把x的最后一层(第四层)即"3":(batch, 2048, 29, 42)送入inner_blocks的最后一个卷积层,得到结果(batch,256,29,42),然后再送入layer_blocls的最后一个卷积层,得到结果(batch,256,29,42)把结果append到result列表中,然后开始从第3个层开始倒序遍历x,先通过inner_blocks,再把刚刚第四层的输出结果使用interpolate方法上采样到该层同样的大小得到inner_top_down,然后把上一层上采样的结果inner_top_down加上该层经过inner_blocks的结果得到last_inner(这样就完成了特征的融合,既有高层的语义信息又有底层的较详细的位置信息),最后把last_inner通过layer_blocks得到该层经过fpn的结果。依次类推,把x的所有层都经过类似的处理,并且为最后一层再增添一个maxpool层最后得到输出
out:{
“0”:(batch, 256, 232, 336)
“1”:(batch, 256, 116, 168)
“2”:(batch, 256, 58, 84)
“3”:(batch, 256, 29, 42)
“pool”:(batch,256,15,21)
}
backnone部分总结
总结一下,backbone部分主要是取resnet网络layer4之前的结构,并且使用IntermediateLayerGetter得到resnet layer1到layer4的输出,得到输出字典
out:{
“0”:(batch, 256, 232, 336)
“1”:(batch, 512, 116, 168)
“2”:(batch, 1024, 58, 84)
“3”:(batch, 2048, 29, 42)
}
再通过fpn,融合高尺度和低尺度的特征信息,并且把输出通道都变成256,并为最后一个层增添一个额外的pool层得到backbone部分的输出字典
out:{
“0”:(batch, 256, 232, 336)
“1”:(batch, 256, 116, 168)
“2”:(batch, 256, 58, 84)
“3”:(batch, 256, 29, 42)
“pool”:(batch,256,15,21)
}
(下面图片上是草稿纸手写的笔记,与上文基本相同)
transfrom部分
到这里就定义好了backbone部分。fasterrcnn再dataloader读入数据时仅仅做了to tensor转为张量的操作,因此读入的图像尺寸都是不一样的,在送入网络之前还要进行transfrom操作把数据打包成一个batch
transform的代码部分不是很复杂,这里把逻辑说一下。
transfrom由GeneralizedRCNNTransform类定义,它是一个nn.Module
它的forward函数需要的输入参数是images和targets,其中images是tensor组成的列表,targets是字典组成的列表,每个字典包含了一张图片的标注信息。
def forward(self, images, targets=None):
images = [img for img in images]
for i in range(len(images)):
image = images[i]
target_index = targets[i] if targets is not None else None
image = self.normalize(image) # 对图像进行标准化处理
image, target_index = self.resize(image, target_index) # 对图像和对应的bboxes缩放到指定范围
images[i] = image
if targets is not None and target_index is not None:
targets[i] = target_index
# 记录resize后的图像尺寸
image_sizes = [img.shape[-2:] for img in images]
images = self.batch_images(images) # 将images打包成一个batch
image_sizes_list = torch.jit.annotate(List[Tuple[int, int]], [])
for image_size in image_sizes:
assert len(image_size) == 2
image_sizes_list.append((image_size[0], image_size[1]))
image_list = ImageList(images, image_sizes_list)
return image_list, targets
- 第一步是遍历每张图片做normalize标准化处理,就是简单的把图片转成tensor,并减去预先提供的均值再除方差
- 第二步再遍历每张图片做resize处理,resize过程就是先得到一个batch中图片的高度和宽度,然后分别获得高度和宽度的最小值min_size和最大值max_size。再使用指定最小边长和实际图片最小边长的比值做缩放比例,把每张图片都按比例缩放到指定的大小,如果按此比例缩放的结果得到最大边长比指定最大边长大,那就改为使用指定最大边长与实际图片最大边长的比值做缩放比例。对图片做完缩放后根据相应的缩放比例对target中的box也要做缩放。完成resize操作后,把每张图片的结果分别放进images列表和targets列表,并记录一下resize之后的尺寸image_sizes,最后把resize之后的images放进batch_images方法,把图片打包成一个batch
def resize(self, image, target):
# image shape is [channel, height, width]
# 得到image的高度和宽度
h, w = image.shape[-2:]
im_shape = torch.tensor(image.shape[-2:])
min_size = float(torch.min(im_shape)) # 获取高宽中的最小值
max_size = float(torch.max(im_shape)) # 获取高宽中的最大值
if self.training:
size = float(self.torch_choice(self.min_size)) # 指定输入图片的最小边长,注意是self.min_size不是min_size
else:
# FIXME assume for now that testing uses the largest scale
size = float(self.min_size[-1]) # 指定输入图片的最小边长,注意是self.min_size不是min_size
scale_factor = size / min_size # 根据指定最小边长和图片最小边长计算缩放比例
# 如果使用该缩放比例计算的图片最大边长大于指定的最大边长
if max_size * scale_factor > self.max_size:
scale_factor = self.max_size / max_size # 将缩放比例设为指定最大边长和图片最大边长之比
# interpolate利用插值的方法缩放图片
# image[None]操作是在最前面添加batch维度[C, H, W] -> [1, C, H, W]
# bilinear只支持4D Tensor
image = torch.nn.functional.interpolate(
image[None], scale_factor=scale_factor, mode='bilinear', align_corners=False)[0]
if target is None:
return image, target
bbox = target["boxes"]
# 根据图像的缩放比例来缩放bbox
bbox = resize_boxes(bbox, (h, w), image.shape[-2:])
target["boxes"] = bbox
return image, target
- 第三步是batch_sizes方法把一批resize之后的图片打包成一个batch,首先获取这一个batch中高度和宽度的最大值,然后把高度和宽度的最大值都向上调整到32的整数倍,然后创建一个shape为(batch_size, 3 ,max_width,max_height)并且全0的batched_imgs,然后遍历images列表,把每张图片复制进batched_imgs的相应位置,对其左上角,其他部分就是为0的tensors,最后新建一个ImageList类,把这里得到的tensors赋值给self.tensors,把之前得到的image_sizes赋值给self.images_sizes.这样就可以通过imagelist.tensors访问到打包成一个batch的shape相同的图片数据,也可以通过imagelist.image_sizes方法得到图片的尺寸数据。
def batch_images(self, images, size_divisible=32):
# type: (List[Tensor], int)
"""
将一批图像打包成一个batch返回(注意batch中每个tensor的shape是相同的)
Args:
images: 输入的一批图片
size_divisible: 将图像高和宽调整到该数的整数倍
Returns:
batched_imgs: 打包成一个batch后的tensor数据
"""
# 分别计算一个batch中所有图片中的最大height, width
max_size = self.max_by_axis([list(img.shape) for img in images])
stride = float(size_divisible)
# max_size = list(max_size)
# 将height向上调整到stride的整数倍
max_size[1] = int(math.ceil(float(max_size[1]) / stride) * stride)
# 将width向上调整到stride的整数倍
max_size[2] = int(math.ceil(float(max_size[2]) / stride) * stride)
# [batch, channel, height, width]
batch_shape = [len(images)] + max_size
# 创建shape为batch_shape且值全部为0的tensor
# images[0]就是一个tensor 为了调用tensor的new_full方法 返回全0的shape为batch_shape的tensor
batched_imgs = images[0].new_full(batch_shape, 0)
for img, pad_img in zip(images, batched_imgs):
# 将输入images中的每张图片复制到新的batched_imgs的每张图片中,对齐左上角,保证bboxes的坐标不变
# 这样保证输入到网络中一个batch的每张图片的shape相同
# copy_: Copies the elements from src into self tensor and returns self
# 把img的像素值复制到pad_img的相同位置处
pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
return batched_imgs
((下面图片上是草稿纸手写的笔记,与上文基本相同))
fasterrcnn rpn网络生成建议框
anchorgenerator部分和rpn_head部分
网络经过backbone之后得到5个预测特征层,然后开始进入rpn网络部分生成一系列的anchor建议框。
首先看到AnchorGenerator类,这个类由传入的size和aspect_rations生成建议框,这里的size默认是
size = ((32), (64), (128), (256), (512))
aspect_rations=((0.5, 1.0, 2.0)x5)
意思是在第一个特征层的每个像素点位置都生成长宽比例为0.5,1.0,2.0三个比例的建议框。
下面是AnchorGenerator的forward函数代码:
def forward(self, image_list, feature_maps):
# type: (ImageList, List[Tensor])
# 获取每个预测特征层的尺寸
grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])
# 获取输入图像的height和width
image_size = image_list.tensors.shape[-2:]
dtype, device = feature_maps[0].dtype, feature_maps[0].device
# one step in feature map equate n pixel stride in origin image
# 计算特征层上的一步等于原始图像上的步长
strides = [[torch.tensor(image_size[0] / g[0], dtype=torch.int64, device=device),
torch.tensor(image_size[1] / g[1], dtype=torch.int64, device=device)] for g in grid_sizes]
# 根据提供的sizes和aspect_ratios生成anchors模板
self.set_cell_anchors(dtype, device)
# 计算/读取所有anchors的坐标信息(这里的anchors信息是映射到原图上的所有anchors信息,不是anchors模板)
# 得到的是一个list列表,对应每张预测特征图映射回原图的anchors坐标信息
anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)
anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
# 遍历一个batch中的每张图像
for i, (image_height, image_width) in enumerate(image_list.image_sizes):
anchors_in_image = []
# 遍历每张预测特征图映射回原图的anchors坐标信息
for anchors_per_feature_map in anchors_over_all_feature_maps:
anchors_in_image.append(anchors_per_feature_map)
anchors.append(anchors_in_image)
# 将每一张图像的所有预测特征层的anchors坐标信息拼接在一起
# anchors是个list,每个元素为一张图像的所有anchors信息
anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
# Clear the cache in case that memory leaks.
self._cache.clear()
return anchors
在这补上RPN_Head的代码
class RPNHead(nn.Module):
"""
add a RPN head with classification and regression
通过滑动窗口计算预测目标概率与bbox regression参数
Arguments:
in_channels: number of channels of the input feature
num_anchors: number of anchors to be predicted
"""
def __init__(self, in_channels, num_anchors):
super(RPNHead, self).__init__()
# 3x3 滑动窗口
self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
# 计算预测的目标概率(这里的目标只是指前景或者背景)
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
# 计算预测的目标bbox regression参数
self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)
for layer in self.children():
if isinstance(layer, nn.Conv2d):
torch.nn.init.normal_(layer.weight, std=0.01)
torch.nn.init.constant_(layer.bias, 0)
def forward(self, x):
# type: (List[Tensor])
logits = []
bbox_reg = []
for i, feature in enumerate(x):
t = F.relu(self.conv(feature))
logits.append(self.cls_logits(t))
bbox_reg.append(self.bbox_pred(t))
return logits, bbox_reg
rpn网络主体
直接看到rpn网络主体的forward函数,具体过程在下面的草稿纸和代码注释中都很详细。
def forward(self, images, features, targets=None):
# type: (ImageList, Dict[str, Tensor], Optional[List[Dict[str, Tensor]]])
"""
Arguments:
images (ImageList): images for which we want to compute the predictions
features (List[Tensor]): features computed from the images that are
used for computing the predictions. Each tensor in the list
correspond to different feature levels
targets (List[Dict[Tensor]): ground-truth boxes present in the image (optional).
If provided, each element in the dict should contain a field `boxes`,
with the locations of the ground-truth boxes.
Returns:
boxes (List[Tensor]): the predicted boxes from the RPN, one Tensor per
image.
losses (Dict[Tensor]): the losses for the model during training. During
testing, it is an empty dict.
"""
# RPN uses all feature maps that are available
# features是所有预测特征层组成的OrderedDict
features = list(features.values())
# 计算每个预测特征层上的预测目标概率和bboxes regression参数
# objectness和pred_bbox_deltas都是list
objectness, pred_bbox_deltas = self.head(features)
# 生成一个batch图像的所有anchors信息
anchors = self.anchor_generator(images, features)
# batch_size
num_images = len(anchors)
# numel() Returns the total number of elements in the input tensor.
# 计算每个预测特征层上的对应的anchors数量
num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]
num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]
# 调整内部tensor格式以及shape
objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness,
pred_bbox_deltas)
# apply pred_bbox_deltas to anchors to obtain the decoded proposals
# note that we detach the deltas because Faster R-CNN do not backprop through
# the proposals
# 将预测的bbox regression参数应用到anchors上得到最终预测bbox坐标
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
proposals = proposals.view(num_images, -1, 4)
# 筛除小boxes框,nms处理,根据预测概率获取前post_nms_top_n个目标
boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
losses = {}
if self.training:
assert targets is not None
# 计算每个anchors最匹配的gt,并将anchors进行分类,前景,背景以及废弃的anchors
labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
# 结合anchors以及对应的gt,计算regression参数
regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
loss_objectness, loss_rpn_box_reg = self.compute_loss(
objectness, pred_bbox_deltas, labels, regression_targets
)
losses = {
"loss_objectness": loss_objectness,
"loss_rpn_box_reg": loss_rpn_box_reg
}
return boxes, losses
计算rpn损失
roi_head部分
经过rpn网络之后每张图片得到了2000个建议框,并且得到了rpn网络的损失函数,然后把得到的2000个建议框送入后续的roihead部分进行roi_pooling到相同的特征大小,然后送入全连接层预测每个框的label和边界框回归参数,在训练阶段需要计算损失,如果是测试阶段则进行后处理,nms处理保留前几个框在原图绘制出来
这里简要谈一下roi pooling的作用,通过roi pooing把预测建议框投影到对应的预测特征层,然后再通过区域max pooling把建议框缩放到指定大小。
def forward(self, features, proposals, image_shapes, targets=None):
# type: (Dict[str, Tensor], List[Tensor], List[Tuple[int, int]], Optional[List[Dict[str, Tensor]]])
"""
Arguments:
features (List[Tensor])
proposals (List[Tensor[N, 4]])
image_shapes (List[Tuple[H, W]])
targets (List[Dict])
"""
# 检查targets的数据类型是否正确
if targets is not None:
for t in targets:
floating_point_types = (torch.float, torch.double, torch.half)
assert t["boxes"].dtype in floating_point_types, "target boxes must of float type"
assert t["labels"].dtype == torch.int64, "target labels must of int64 type"
if self.training:
# 划分正负样本,统计对应gt的标签以及边界框回归信息
proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
else:
labels = None
regression_targets = None
matched_idxs = None
# 将采集样本通过roi_pooling层
box_features = self.box_roi_pool(features, proposals, image_shapes)
# 通过roi_pooling后的两层全连接层
box_features = self.box_head(box_features)
# 接着分别预测目标类别和边界框回归参数 class_logits(2048, 21) box_regression(2048,84)
class_logits, box_regression = self.box_predictor(box_features)
result = torch.jit.annotate(List[Dict[str, torch.Tensor]], [])
losses = {}
if self.training:
assert labels is not None and regression_targets is not None
loss_classifier, loss_box_reg = fastrcnn_loss(
# labels:List[(512,), (512,), (512,), (512,)] regression_targets:Tuple((512,4),(512,4),(512,4),(512,4))
class_logits, box_regression, labels, regression_targets)
losses = {
"loss_classifier": loss_classifier,
"loss_box_reg": loss_box_reg
}
else:
boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
num_images = len(boxes)
for i in range(num_images):
result.append(
{
"boxes": boxes[i],
"labels": labels[i],
"scores": scores[i],
}
)
return result, losses
得到class_logits和box_regression之后,需要来计算fasterrcnn roihead部分的损失,计算损失的fastrcnn_loss方法需要class_logits, box_regression, labels, regression_targets四个量,其中class_logits和box_regression是刚刚得到的,labels和regressison_targets是之前得到的。
其中:
labels:List[(512,), (512,), (512,), (512,)]
regression_targets:Tuple((512,4),(512,4),(512,4),(512,4))。
我们把labels和regression_targets在第0个维度上拼接一下得到分别得到2048和(2048,4)的形状
在训练阶段:
使用交叉熵损失函数根据class_logits和labels计算分类损失
使用smoothL1损失计算匹配到正样本处的边界框回归损失
在测试阶段进行后处理操作,通过nms留下需要的候选框