一、背景介绍
在之前的优化方案一中,我们将模型拆分为CNN和非CNN两个部分,分别使用TensorRT和Paddle Inference进行推理。精度与CPU-FP32版本一致的前提下,推理耗时为159ms,仍无法满足性能要求。
进一步分析模型结构后,我们发现了两个新的优化机会:
- 如果相机外参(Extrinsic Parameters)固定,则可以对外参处理分支进行常量折叠,减少计算量并避免TensorRT不支持的算子。
- Transformer模块中的LayerNormalization算子可以通过普通算子替换,从而更好地适配TensorRT,并结合CUDA Graph进一步加速推理。
需要注意的是,NuScenes数据集中不同样本的外参并不相同。为了保证一致性,在测试时所有样本都使用第一个样本的外参。
二、优化思路详解
本次优化方案主要包括以下几个方面:
-
外参分支常量折叠
由于外参在推理过程中固定不变,我们可以预先计算其变换结果,并将该分支折叠为常量,从而减少计算量和动态操作,提升TensorRT兼容性。 -
替换LayerNormalization算子
将Transformer中的LayerNormalization分解为基本算子(如ReduceMean、Pow、Add、Div等),以便TensorRT更好地融合和优化。 -
混合精度推理
- CNN部分:使用INT8+FP16混合精度,在保证精度的前提下最大化速度提升。
- 其他部分:CONV和MLP使用FP16,其余算子使用FP32,平衡精度与性能。
-
CUDA Graph优化
通过CUDA Graph捕获内核执行序列,减少CPU开销和内核启动延迟,尤其适用于固定计算流的推理任务。
三、测试结果与分析
配置 | mAP | 推理耗时(ms) |
---|---|---|
CPU-FP32 | 0.2864 | - |
Orin-混合精度 | 0.2873 | 64.48 |
优化后的模型在Orin平台上实现了64.48ms的推理速度,显著优于之前的159ms,并且满足10Hz的实时性要求。同时,mAP略有提升(0.2864 → 0.2873),说明优化并未损害模型精度。
四、详细操作步骤
4.1 基准精度测试
首先在Paddle3D-PETRv1环境中测试原始模型的精度,确保后续优化对比的基准正确。
# 清理历史推理结果
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f
# 设置环境变量(禁用TF32加速,确保精度可比性)
export NVIDIA_TF32_OVERRIDE=0
# 运行Paddle推理
python3 0_run_paddle_infer.py
# 准备配置文件和数据集链接
cp ../Paddle3D/configs/petr/petr_vovnet_gridmask_p4_800x320.yml ./
ln -s /home/Paddle3D/data .
# 精度评估
python3 1_run_petrv_eval.py
输出
mAP: 0.2864
mATE: 0.8415
mASE: 0.4668
mAOE: 0.7016
mAVE: 0.8735
mAAE: 0.3094
NDS: 0.3239
Eval time: 6.0s
Per-class results:
Object Class AP ATE ASE AOE AVE AAE
car 0.509 0.677 0.170 0.147 0.246 0.060
truck 0.357 0.895 0.238 0.105 0.166 0.017
bus 0.357 1.090 0.120 0.424 2.114 0.190
trailer 0.000 1.000 1.000 1.000 1.000 1.000
construction_vehicle 0.000 1.000 1.000 1.000 1.000 1.000
pedestrian 0.497 0.696 0.251 0.674 0.494 0.208
motorcycle 0.415 0.846 0.322 1.272 0.075 0.001
bicycle 0.123 0.751 0.225 0.692 1.893 0.000
traffic_cone 0.607 0.461 0.342 nan nan nan
barrier 0.000 1.000 1.000 1.000 nan nan
4.2 生成子图与模型转换
步骤一:生成NMSFreeCoder之前的子图
使用PaddlePaddle 2.4.2容器生成模型子图,剥离后处理部分,便于前端优化。
cd /home/apollo
docker run --gpus all --shm-size=128g -it -e NVIDIA_VISIBLE_DEVICES=all \
--privileged --net=host \
-v $PWD:/home -w /home \
--rm registry.baidubce.com/paddlepaddle/paddle:2.4.2-gpu-cuda11.7-cudnn8.4-trt8.4 /bin/bash
cd /home/bev_opt_ver3
rm -rf graph0
python3 2_gen_graph_0.py
exit
步骤二:转换为ONNX模型
使用PaddlePaddle 3.1容器进行模型转换,并安装最新版Paddle2ONNX工具。
cd /home/apollo
# 创建容器
docker run --gpus all --shm-size=128g -it -e NVIDIA_VISIBLE_DEVICES=all \
--privileged --net=host \
-v $PWD:/home -w /home \
--name bev_opt_ver3_paddle31 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.1.0-gpu-cuda12.6-cudnn9.5 /bin/bash
# 进入工程目录
cd /home/bev_opt_ver3/
# 安装最新版本的Paddle2ONNX
git clone https://github.com/PaddlePaddle/Paddle2ONNX.git
cd Paddle2ONNX/
git checkout 3e77ec7d13a055225e1cb8b7a4abb1a7ae7d1d58
git submodule update --init --recursive
bash .github/workflows/scripts/download_protobuf.sh
cp installed_protobuf/include/google /usr/include/x86_64-linux-gnu/ -rf
cp installed_protobuf/lib64/* /usr/lib/x86_64-linux-gnu/ -rf
cp installed_protobuf/bin/protoc* /usr/bin/ -rf
sed -i '/paddlepaddle==3.0.0.*/d' pyproject.toml
sed -i 's#"site-packages"#"dist-packages"#g' CMakeLists.txt
pip3 install .
pip3 install packaging
pip3 install onnxsim
# 转换模型(opset最低为17)
cd /home/bev_opt_ver3/
rm -f graph0.onnx
paddle2onnx \
--model_dir graph0 \
--model_filename petr_inference.pdmodel \
--params_filename petr_inference.pdiparams \
--save_file graph0.onnx
4.3 ONNX模型处理与优化
步骤一:外参分支常量折叠
# 保存外参分支的结果
python3 3_save_img2lidars_subgraph.py
# 对外参分支做常量消除
rm -f graph0_img2lidars_folding.onnx
python3 4_run_img2lidars_folding.py
步骤二:拆分模型并替换算子
rm -f graph0_img2lidars_folding_part1.onnx
rm -f graph0_img2lidars_folding_part2.onnx
python3 5_split_graph.py
# 将transformer子图的opset版本转换为12,替换不支持的算子
python3 6_replace_unsupported_ops.py graph0_img2lidars_folding_part2.onnx graph0_img2lidars_folding_part2.onnx
步骤三:ONNX推理与精度验证
# 删除之前生成的推理结果
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f
# 运行onnx推理,生成结果
python3 7_onnx_forward_two_parts.py graph0_img2lidars_folding_part1.onnx graph0_img2lidars_folding_part2.onnx
# 在Paddle3D-PETRv1模型测试环境中,计算推理的精度
python3 1_run_petrv_eval.py
输出
mAP: 0.2864
mATE: 0.8415
mASE: 0.4668
mAOE: 0.7016
mAVE: 0.8735
mAAE: 0.3094
NDS: 0.3239
Eval time: 6.1s
Per-class results:
Object Class AP ATE ASE AOE AVE AAE
car 0.509 0.677 0.170 0.147 0.246 0.060
truck 0.357 0.895 0.238 0.105 0.166 0.017
bus 0.357 1.090 0.120 0.424 2.114 0.190
trailer 0.000 1.000 1.000 1.000 1.000 1.000
construction_vehicle 0.000 1.000 1.000 1.000 1.000 1.000
pedestrian 0.497 0.696 0.251 0.674 0.494 0.208
motorcycle 0.415 0.846 0.322 1.272 0.075 0.001
bicycle 0.123 0.751 0.225 0.692 1.893 0.000
traffic_cone 0.607 0.461 0.342 nan nan nan
barrier 0.000 1.000 1.000 1.000 nan nan
4.4 Orin平台部署与测试
步骤一:编译与运行推理程序
g++ -O3 -o 8_run_cuda_infer 8_run_cuda_infer.cpp \
-Wno-deprecated-declarations \
-I /usr/local/cuda/include \
-I /usr/local/TensorRT-10.6.0.26/include \
-L /usr/local/cuda/lib64 \
-L /usr/local/TensorRT-10.6.0.26/lib \
-lcudart -ldl -lpthread -lnvinfer -lnvinfer_plugin \
-lnvonnxparser -lopenblas -lstdc++fs
# 运行推理
export NVIDIA_TF32_OVERRIDE=0
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f
# 推理并生成结果
rm *.engine *.cache -f
./8_run_cuda_infer graph0_img2lidars_folding_part1.onnx graph0_img2lidars_folding_part2.onnx
输出
80 E2E耗时: 64.4814 ms;
步骤二:精度测试
# 在Paddle3D-PETRv1模型测试精度环境中运行
python3 1_run_petrv_eval.py
输出
mAP: 0.2873
mATE: 0.8467
mASE: 0.4654
mAOE: 0.6979
mAVE: 0.8546
mAAE: 0.3123
NDS: 0.3259
Eval time: 6.7s
Per-class results:
Object Class AP ATE ASE AOE AVE AAE
car 0.516 0.660 0.170 0.141 0.253 0.062
truck 0.365 0.922 0.235 0.114 0.179 0.019
bus 0.353 1.114 0.127 0.313 1.962 0.207
trailer 0.000 1.000 1.000 1.000 1.000 1.000
construction_vehicle 0.000 1.000 1.000 1.000 1.000 1.000
pedestrian 0.503 0.682 0.250 0.671 0.493 0.210
motorcycle 0.413 0.852 0.323 1.375 0.073 0.001
bicycle 0.132 0.745 0.204 0.667 1.877 0.000
traffic_cone 0.592 0.492 0.345 nan nan nan
barrier 0.000 1.000 1.000 1.000 nan nan
五、总结
本文详细介绍了PETRv1模型的第二次优化方案,通过外参分支常量折叠、LayerNormalization算子替换、混合精度推理和CUDA Graph优化,在Orin平台上实现了64.48ms的推理速度,满足10Hz的实时性要求,同时保持了较高的精度(mAP=0.2873)。
注意事项:
- 常量折叠依赖于固定外参,在实际应用中需根据场景调整。
- 混合精度推理可能受硬件和驱动版本影响,建议在目标平台充分验证。
- 推荐使用最新版本的TensorRT和CUDA工具链以获得最佳性能。
六、相关代码
0_run_paddle_infer.py
import numpy as np
import paddle
import os
import sys
import time
import paddle.inference as paddle_infer
import glob
import tqdm
def main():
config = paddle_infer.Config("/home/petrv1/petr_inference.pdmodel",
"/home/petrv1/petr_inference.pdiparams")
config.enable_use_gpu(256, 0)
config.disable_glog_info()
predictor = paddle_infer.create_predictor(config)
input_names = predictor.get_input_names()
output_names = predictor.get_output_names()
print("input_names:",input_names)
print("output_names:",output_names)
idx=0
while True:
img_path=f'../Paddle3D/model_iodata/{
idx}_img.bin'
if not os.path.exists(img_path):
break
with open(img_path, 'rb') as f:
input_images = np.frombuffer(f.read(), dtype=np.float32).reshape((1,6, 3, 320, 800))
# 固定用第一个样本的外参
input_img2lidars_path="../Paddle3D/model_iodata/0_img2lidars.bin"
with open(input_img2lidars_path, 'rb') as f:
input_img2lidars = np.frombuffer(f.read(), dtype=np.float32).reshape((1,6,4,4))
predictor.get_input_handle(input_names[0]).copy_from_cpu(input_images)
predictor.get_input_handle(input_names[1]).copy_from_cpu(input_img2lidars)
predictor.run()
output0_tensor = predictor.get_output_handle(output_names[0])
output1_tensor = predictor.get_output_handle(output_names[1])
output2_tensor = predictor.get_output_handle(output_names[2])
bboxes = output0_tensor.copy_to_cpu()
scores = output1_tensor.copy_to_cpu()
labels = output2_tensor.copy_to_cpu()
with open(f'../Paddle3D/model_iodata/{
idx}_bboxes.bin', 'wb') as f:
f.write(bboxes.tobytes())
with open(f'../Paddle3D/model_iodata/{
idx}_scores.bin', 'wb') as f:
f.write(scores.tobytes())
with open(f'../Paddle3D/model_iodata/{
idx}_labels.bin', 'wb') as f:
f.write(labels.tobytes())
idx+=1
if __name__ == "__main__":
main()
1_run_petrv_eval.py
import argparse
import os
import random
import numpy as np
import paddle
from paddle3d.apis.config import Config
from paddle3d.apis.trainer import Trainer
from paddle3d.slim import get_qat_config
from paddle3d.utils.logger import logger
from paddle3d.sample import Sample, SampleMeta
from paddle3d.geometries import BBoxes3D
def bbox3d2result(bboxes, scores, labels, attrs=None):
"""Convert detection results to a list of numpy arrays.
"""
result_dict = dict(
boxes_3d=bboxes, scores_3d=scores, labels_3d=labels)
if attrs is not None:
result_dict['attrs_3d'] = attrs
return result_dict
class CustomTrainer(Trainer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def _parse_results_to_sample(self, results: dict, sample: dict):
num_samples = len(results)
new_results = []
for i in range(num_samples):
data = Sample(None, sample["modality"][i])
bboxes_3d = results[i]['pts_bbox']["boxes_3d"]
labels = results[i]['pts_bbox']["labels_3d"]
confidences = results[i]['pts_bbox']["scores_3d"]
bottom_center = bboxes_3d[:, :3]
gravity_center = np.zeros_like(bottom_center)
gravity_center[:, :2] = bottom_center[:, :2]
gravity_center[:, 2] = bottom_center[:, 2] + bboxes_3d[:, 5] * 0.5
bboxes_3d[:, :3] = gravity_center
data.bboxes_3d = BBoxes3D(bboxes_3d[:, 0:7])
data.bboxes_3d.coordmode = 'Lidar'
data.bboxes_3d.origin = [0.5, 0.5, 0.5]
data.bboxes_3d.rot_axis = 2
data.bboxes_3d.velocities = bboxes_3d[:, 7:9]
data['bboxes_3d_numpy'] = bboxes_3d[:, 0:7]
data['bboxes_3d_coordmode'] = 'Lidar'
data['bboxes_3d_origin'] = [0.5, 0.5, 0.5]
data['bboxes_3d_rot_axis'] = 2
data['bboxes_3d_velocities'] = bboxes_3d[:, 7:9]
data.labels = labels
data.confidences = confidences
data.meta = SampleMeta(id=sample["meta"][i]['id'])
if "calibs" in sample:
calib = [calibs.numpy()[i] for calibs in sample["calibs"]]
data.calibs = calib
new_results.append(data)
return new_results
def simple_test_pts(self,idx):
with open(f'../Paddle3D/model_iodata/{
idx}_bboxes.bin', 'rb') as f:
bboxes = np.frombuffer(f.read(), dtype=np.float32).reshape((300,9)).copy()
with open(f'../Paddle3D/model_iodata/{
idx}_scores.bin', 'rb') as f:
scores = np.frombuffer(f.read(), dtype=np.float32).reshape((300,)).copy()
with open(f'../Paddle3D/model_iodata/{
idx}_labels.bin', 'rb') as f:
labels = np.frombuffer(f.read(), dtype=np.int64).reshape((300,)).copy()
bbox_results = [bbox3d2result(bboxes, scores, labels)]
return bbox_results
def evaluate(self):
msg = 'evaluate on validate dataset'
metric_obj = self.val_dataset.metric
for idx, sample in self.logger.enumerate(self.eval_dataloader, msg=msg):
img_metas = sample['meta']
bbox_list = [dict() for i in range(len(img_metas))]
bbox_pts = self.simple_test_pts(idx)
for result_dict, pts_bbox in zip(bbox_list, bbox_pts):
result_dict['pts_bbox'] = pts_bbox
results=bbox_list
preds=self._parse_results_to_sample(bbox_list,sample)
metric_obj.update(predictions=preds, ground_truths=sample)
metrics = metric_obj.compute(verbose=True)
return metrics
batch_size=1
cfg = Config(path='petr_vovnet_gridmask_p4_800x320.yml', batch_size=batch_size)
dic = cfg.to_dict()
batch_size = dic.pop('batch_size')
dic.update({
'dataloader_fn': {
'batch_size': batch_size,
'num_workers': 1}})
dic['checkpoint'] = None
dic['resume'] = False
trainer = CustomTrainer(**dic)
trainer.evaluate()
2_gen_graph_0.py
import argparse
import sys
import numpy as np
import argparse
import paddle.static as static
import paddle.fluid.core as core
def new_prepend_feed_ops(inference_program,
feed_target_names,
feed_holder_name='feed'):
import paddle.fluid.core as core
if len(feed_target_names) == 0:
return
global_block = inference_program.global_block()
feed_var = global_block.create_var(
name=feed_holder_name,
type=core.VarDesc.VarType.FEED_MINIBATCH,
persistable=True)
for i, name in enumerate(feed_target_names):
if not global_block.has_var(name):
continue
out = global_block.var(name)
global_block._prepend_op(
type='feed',
inputs={
'X': [feed_var]},
outputs={
'Out': [out]},
attrs={
'col': i})
def append_fetch_ops(program, fetch_target_names, fetch_holder_name='fetch'):
import paddle.fluid.core as core
global_block = program.global_block()
fetch_var = global_block.create_var(
name=fetch_holder_name,type=core.VarDesc.VarType.FETCH_LIST,persistable=True)
for i, name in enumerate(fetch_target_names):
global_block.append_op(
type='fetch',
inputs={
'X': [name]},
outputs={
'Out': [fetch_var]},
attrs={
'col': i})
def insert_fetch(program, fetchs, fetch_holder_name="fetch"):
global_block = program.global_block()
need_to_remove_op_index = list()
for i, op in enumerate(global_block.ops):
if op.type == 'fetch':
need_to_remove_op_index.append(i)
for index in need_to_remove_op_index[::-1]:
global_block._remove_op(index)
program.desc.flush()
append_fetch_ops(program, fetchs, fetch_holder_name)
def process_old_ops_desc(program):
for i in range(len(program.blocks[0].ops)):
if program.blocks[0].ops[i].type == "matmul":
if not program.blocks[0].ops[i].has_attr("head_number"):
program.blocks[0].ops[i]._set_attr("head_number", 1)
def infer_shape(program, input_shape_dict):
paddle.enable_static()
for k, v in input_shape_dict.items():
program.blocks[0].var(k).desc.set_shape(v)
for i in range(len(program.blocks)):
for j in range(len(program.blocks[0].ops)):
try:
program.blocks[i].ops[j].desc.infer_shape(program.blocks[i].desc)
except:
pass
def replace_shape_0_with_constant(program, constant_value):
global_block = program.global_block()
if not global_block.has_var("shape_0.tmp_0"):
print("Warning: shape_0.tmp_0 not found in program, skipping replacement.")
return
target_var = global_block.var("shape_0.tmp_0")
# 查找产生该变量的操作
op_index_to_remove = None
for idx, op in enumerate(global_block.ops):
if "shape_0.tmp_0" in op.output_arg_names:
op_index_to_remove = idx
break
# 移除原有操作
if op_index_to_remove is not None:
global_block._remove_op(op_index_to_remove)
# 创建常量值
constant_tensor = np.array(constant_value, dtype=np.int32)
# 在计算图开头插入常量操作
insert_idx = 0
for idx, op in enumerate(global_block.ops):
if op.type == 'feed':
insert_idx = idx + 1
global_block._insert_op(
index=insert_idx,