Apollo-petrv1模型优化方案二

一、背景介绍

在之前的优化方案一中,我们将模型拆分为CNN和非CNN两个部分,分别使用TensorRT和Paddle Inference进行推理。精度与CPU-FP32版本一致的前提下,推理耗时为159ms,仍无法满足性能要求。

进一步分析模型结构后,我们发现了两个新的优化机会:

  1. 如果相机外参(Extrinsic Parameters)固定,则可以对外参处理分支进行常量折叠,减少计算量并避免TensorRT不支持的算子。
  2. Transformer模块中的LayerNormalization算子可以通过普通算子替换,从而更好地适配TensorRT,并结合CUDA Graph进一步加速推理。

需要注意的是,NuScenes数据集中不同样本的外参并不相同。为了保证一致性,在测试时所有样本都使用第一个样本的外参。


二、优化思路详解

本次优化方案主要包括以下几个方面:

  1. 外参分支常量折叠
    由于外参在推理过程中固定不变,我们可以预先计算其变换结果,并将该分支折叠为常量,从而减少计算量和动态操作,提升TensorRT兼容性。

  2. 替换LayerNormalization算子
    将Transformer中的LayerNormalization分解为基本算子(如ReduceMean、Pow、Add、Div等),以便TensorRT更好地融合和优化。

  3. 混合精度推理

    • CNN部分:使用INT8+FP16混合精度,在保证精度的前提下最大化速度提升。
    • 其他部分:CONV和MLP使用FP16,其余算子使用FP32,平衡精度与性能。
  4. CUDA Graph优化
    通过CUDA Graph捕获内核执行序列,减少CPU开销和内核启动延迟,尤其适用于固定计算流的推理任务。


三、测试结果与分析

配置 mAP 推理耗时(ms)
CPU-FP32 0.2864 -
Orin-混合精度 0.2873 64.48

优化后的模型在Orin平台上实现了64.48ms的推理速度,显著优于之前的159ms,并且满足10Hz的实时性要求。同时,mAP略有提升(0.2864 → 0.2873),说明优化并未损害模型精度。


四、详细操作步骤

4.1 基准精度测试

首先在Paddle3D-PETRv1环境中测试原始模型的精度,确保后续优化对比的基准正确。

# 清理历史推理结果
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f

# 设置环境变量(禁用TF32加速,确保精度可比性)
export NVIDIA_TF32_OVERRIDE=0

# 运行Paddle推理
python3 0_run_paddle_infer.py

# 准备配置文件和数据集链接
cp ../Paddle3D/configs/petr/petr_vovnet_gridmask_p4_800x320.yml ./
ln -s /home/Paddle3D/data .

# 精度评估
python3 1_run_petrv_eval.py

输出

mAP: 0.2864
mATE: 0.8415
mASE: 0.4668
mAOE: 0.7016
mAVE: 0.8735
mAAE: 0.3094
NDS: 0.3239
Eval time: 6.0s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.509   0.677   0.170   0.147   0.246   0.060
truck   0.357   0.895   0.238   0.105   0.166   0.017
bus     0.357   1.090   0.120   0.424   2.114   0.190
trailer 0.000   1.000   1.000   1.000   1.000   1.000
construction_vehicle    0.000   1.000   1.000   1.000   1.000   1.000
pedestrian      0.497   0.696   0.251   0.674   0.494   0.208
motorcycle      0.415   0.846   0.322   1.272   0.075   0.001
bicycle 0.123   0.751   0.225   0.692   1.893   0.000
traffic_cone    0.607   0.461   0.342   nan     nan     nan
barrier 0.000   1.000   1.000   1.000   nan     nan

4.2 生成子图与模型转换

步骤一:生成NMSFreeCoder之前的子图

使用PaddlePaddle 2.4.2容器生成模型子图,剥离后处理部分,便于前端优化。

cd /home/apollo
docker run --gpus all --shm-size=128g -it -e NVIDIA_VISIBLE_DEVICES=all \
	--privileged --net=host \
	-v $PWD:/home -w /home \
	--rm registry.baidubce.com/paddlepaddle/paddle:2.4.2-gpu-cuda11.7-cudnn8.4-trt8.4 /bin/bash	
cd /home/bev_opt_ver3
rm -rf graph0
python3 2_gen_graph_0.py
exit
步骤二:转换为ONNX模型

使用PaddlePaddle 3.1容器进行模型转换,并安装最新版Paddle2ONNX工具。

cd /home/apollo
# 创建容器
docker run --gpus all --shm-size=128g -it -e NVIDIA_VISIBLE_DEVICES=all \
	--privileged --net=host \
	-v $PWD:/home -w /home \
	--name bev_opt_ver3_paddle31 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.1.0-gpu-cuda12.6-cudnn9.5 /bin/bash	

# 进入工程目录
cd /home/bev_opt_ver3/

# 安装最新版本的Paddle2ONNX
git clone https://github.com/PaddlePaddle/Paddle2ONNX.git
cd Paddle2ONNX/
git checkout 3e77ec7d13a055225e1cb8b7a4abb1a7ae7d1d58
git submodule update --init --recursive
bash .github/workflows/scripts/download_protobuf.sh
cp installed_protobuf/include/google /usr/include/x86_64-linux-gnu/ -rf
cp installed_protobuf/lib64/* /usr/lib/x86_64-linux-gnu/ -rf
cp installed_protobuf/bin/protoc* /usr/bin/ -rf
sed -i '/paddlepaddle==3.0.0.*/d' pyproject.toml
sed -i 's#"site-packages"#"dist-packages"#g' CMakeLists.txt
pip3 install .
pip3 install packaging
pip3 install onnxsim

# 转换模型(opset最低为17)
cd /home/bev_opt_ver3/
rm -f graph0.onnx
paddle2onnx \
  --model_dir graph0 \
  --model_filename petr_inference.pdmodel \
  --params_filename petr_inference.pdiparams \
  --save_file graph0.onnx

4.3 ONNX模型处理与优化

步骤一:外参分支常量折叠
# 保存外参分支的结果
python3 3_save_img2lidars_subgraph.py

# 对外参分支做常量消除
rm -f graph0_img2lidars_folding.onnx
python3 4_run_img2lidars_folding.py
步骤二:拆分模型并替换算子
rm -f graph0_img2lidars_folding_part1.onnx
rm -f graph0_img2lidars_folding_part2.onnx
python3 5_split_graph.py

# 将transformer子图的opset版本转换为12,替换不支持的算子
python3 6_replace_unsupported_ops.py graph0_img2lidars_folding_part2.onnx graph0_img2lidars_folding_part2.onnx
步骤三:ONNX推理与精度验证
# 删除之前生成的推理结果
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f

# 运行onnx推理,生成结果
python3 7_onnx_forward_two_parts.py graph0_img2lidars_folding_part1.onnx graph0_img2lidars_folding_part2.onnx

# 在Paddle3D-PETRv1模型测试环境中,计算推理的精度
python3 1_run_petrv_eval.py

输出

mAP: 0.2864
mATE: 0.8415
mASE: 0.4668
mAOE: 0.7016
mAVE: 0.8735
mAAE: 0.3094
NDS: 0.3239
Eval time: 6.1s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.509   0.677   0.170   0.147   0.246   0.060
truck   0.357   0.895   0.238   0.105   0.166   0.017
bus     0.357   1.090   0.120   0.424   2.114   0.190
trailer 0.000   1.000   1.000   1.000   1.000   1.000
construction_vehicle    0.000   1.000   1.000   1.000   1.000   1.000
pedestrian      0.497   0.696   0.251   0.674   0.494   0.208
motorcycle      0.415   0.846   0.322   1.272   0.075   0.001
bicycle 0.123   0.751   0.225   0.692   1.893   0.000
traffic_cone    0.607   0.461   0.342   nan     nan     nan
barrier 0.000   1.000   1.000   1.000   nan     nan

4.4 Orin平台部署与测试

步骤一:编译与运行推理程序
g++ -O3 -o 8_run_cuda_infer 8_run_cuda_infer.cpp  \
    -Wno-deprecated-declarations \
    -I /usr/local/cuda/include \
    -I /usr/local/TensorRT-10.6.0.26/include \
    -L /usr/local/cuda/lib64 \
    -L /usr/local/TensorRT-10.6.0.26/lib \
    -lcudart  -ldl -lpthread -lnvinfer -lnvinfer_plugin \
    -lnvonnxparser -lopenblas  -lstdc++fs 

# 运行推理
export NVIDIA_TF32_OVERRIDE=0
rm ../Paddle3D/model_iodata/*labels.bin -f
rm ../Paddle3D/model_iodata/*scores.bin -f
rm ../Paddle3D/model_iodata/*bboxes.bin -f

# 推理并生成结果
rm *.engine *.cache -f
./8_run_cuda_infer graph0_img2lidars_folding_part1.onnx graph0_img2lidars_folding_part2.onnx

输出

80  E2E耗时: 64.4814 ms;
步骤二:精度测试
# 在Paddle3D-PETRv1模型测试精度环境中运行
python3 1_run_petrv_eval.py

输出

mAP: 0.2873
mATE: 0.8467
mASE: 0.4654
mAOE: 0.6979
mAVE: 0.8546
mAAE: 0.3123
NDS: 0.3259
Eval time: 6.7s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.516   0.660   0.170   0.141   0.253   0.062
truck   0.365   0.922   0.235   0.114   0.179   0.019
bus     0.353   1.114   0.127   0.313   1.962   0.207
trailer 0.000   1.000   1.000   1.000   1.000   1.000
construction_vehicle    0.000   1.000   1.000   1.000   1.000   1.000
pedestrian      0.503   0.682   0.250   0.671   0.493   0.210
motorcycle      0.413   0.852   0.323   1.375   0.073   0.001
bicycle 0.132   0.745   0.204   0.667   1.877   0.000
traffic_cone    0.592   0.492   0.345   nan     nan     nan
barrier 0.000   1.000   1.000   1.000   nan     nan

五、总结

本文详细介绍了PETRv1模型的第二次优化方案,通过外参分支常量折叠LayerNormalization算子替换混合精度推理CUDA Graph优化,在Orin平台上实现了64.48ms的推理速度,满足10Hz的实时性要求,同时保持了较高的精度(mAP=0.2873)。


注意事项

  • 常量折叠依赖于固定外参,在实际应用中需根据场景调整。
  • 混合精度推理可能受硬件和驱动版本影响,建议在目标平台充分验证。
  • 推荐使用最新版本的TensorRT和CUDA工具链以获得最佳性能。

六、相关代码

0_run_paddle_infer.py
import numpy as np
import paddle
import os
import sys
import time
import paddle.inference as paddle_infer
import glob
import tqdm

def main():
    config = paddle_infer.Config("/home/petrv1/petr_inference.pdmodel", 
                                 "/home/petrv1/petr_inference.pdiparams")
    config.enable_use_gpu(256, 0) 
    config.disable_glog_info()
    predictor = paddle_infer.create_predictor(config)
    input_names = predictor.get_input_names()
    output_names = predictor.get_output_names()
    print("input_names:",input_names)
    print("output_names:",output_names)
    idx=0
    while True:        
        img_path=f'../Paddle3D/model_iodata/{
     
     idx}_img.bin'        
        if not os.path.exists(img_path):
            break

        with open(img_path, 'rb') as f:
            input_images = np.frombuffer(f.read(), dtype=np.float32).reshape((1,6, 3, 320, 800))
        
        # 固定用第一个样本的外参
        input_img2lidars_path="../Paddle3D/model_iodata/0_img2lidars.bin"
        with open(input_img2lidars_path, 'rb') as f:
            input_img2lidars = np.frombuffer(f.read(), dtype=np.float32).reshape((1,6,4,4))
        
        predictor.get_input_handle(input_names[0]).copy_from_cpu(input_images)
        predictor.get_input_handle(input_names[1]).copy_from_cpu(input_img2lidars)  
        predictor.run()
        output0_tensor = predictor.get_output_handle(output_names[0])
        output1_tensor = predictor.get_output_handle(output_names[1])
        output2_tensor = predictor.get_output_handle(output_names[2])                
        bboxes = output0_tensor.copy_to_cpu()
        scores = output1_tensor.copy_to_cpu()
        labels = output2_tensor.copy_to_cpu()
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_bboxes.bin', 'wb') as f:
            f.write(bboxes.tobytes())
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_scores.bin', 'wb') as f:
            f.write(scores.tobytes())
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_labels.bin', 'wb') as f:
            f.write(labels.tobytes())
        idx+=1
if __name__ == "__main__":
    main()
1_run_petrv_eval.py
import argparse
import os
import random
import numpy as np
import paddle
from paddle3d.apis.config import Config
from paddle3d.apis.trainer import Trainer
from paddle3d.slim import get_qat_config
from paddle3d.utils.logger import logger
from paddle3d.sample import Sample, SampleMeta
from paddle3d.geometries import BBoxes3D

def bbox3d2result(bboxes, scores, labels, attrs=None):
    """Convert detection results to a list of numpy arrays.
    """
    result_dict = dict(
        boxes_3d=bboxes, scores_3d=scores, labels_3d=labels)

    if attrs is not None:
        result_dict['attrs_3d'] = attrs

    return result_dict
    
class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
           
    def _parse_results_to_sample(self, results: dict, sample: dict):
        num_samples = len(results)
        new_results = []
        for i in range(num_samples):
            data = Sample(None, sample["modality"][i])
            bboxes_3d = results[i]['pts_bbox']["boxes_3d"]
            labels = results[i]['pts_bbox']["labels_3d"]
            confidences = results[i]['pts_bbox']["scores_3d"]
            bottom_center = bboxes_3d[:, :3]
            gravity_center = np.zeros_like(bottom_center)
            gravity_center[:, :2] = bottom_center[:, :2]
            gravity_center[:, 2] = bottom_center[:, 2] + bboxes_3d[:, 5] * 0.5
            bboxes_3d[:, :3] = gravity_center
            data.bboxes_3d = BBoxes3D(bboxes_3d[:, 0:7])
            data.bboxes_3d.coordmode = 'Lidar'
            data.bboxes_3d.origin = [0.5, 0.5, 0.5]
            data.bboxes_3d.rot_axis = 2
            data.bboxes_3d.velocities = bboxes_3d[:, 7:9]
            data['bboxes_3d_numpy'] = bboxes_3d[:, 0:7]
            data['bboxes_3d_coordmode'] = 'Lidar'
            data['bboxes_3d_origin'] = [0.5, 0.5, 0.5]
            data['bboxes_3d_rot_axis'] = 2
            data['bboxes_3d_velocities'] = bboxes_3d[:, 7:9]
            data.labels = labels
            data.confidences = confidences
            data.meta = SampleMeta(id=sample["meta"][i]['id'])
            if "calibs" in sample:
                calib = [calibs.numpy()[i] for calibs in sample["calibs"]]
                data.calibs = calib
            new_results.append(data)
        return new_results
    
    def simple_test_pts(self,idx):
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_bboxes.bin', 'rb') as f:
            bboxes = np.frombuffer(f.read(), dtype=np.float32).reshape((300,9)).copy()
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_scores.bin', 'rb') as f:
            scores = np.frombuffer(f.read(), dtype=np.float32).reshape((300,)).copy()
        with open(f'../Paddle3D/model_iodata/{
     
     idx}_labels.bin', 'rb') as f:
            labels = np.frombuffer(f.read(), dtype=np.int64).reshape((300,)).copy()
        bbox_results = [bbox3d2result(bboxes, scores, labels)]
        return bbox_results
        
    def evaluate(self):
        msg = 'evaluate on validate dataset'
        metric_obj = self.val_dataset.metric
        for idx, sample in self.logger.enumerate(self.eval_dataloader, msg=msg):
            img_metas = sample['meta']
            bbox_list = [dict() for i in range(len(img_metas))]
            bbox_pts = self.simple_test_pts(idx)
            for result_dict, pts_bbox in zip(bbox_list, bbox_pts):
                result_dict['pts_bbox'] = pts_bbox
            results=bbox_list            
            preds=self._parse_results_to_sample(bbox_list,sample)     
            metric_obj.update(predictions=preds, ground_truths=sample)
        metrics = metric_obj.compute(verbose=True)
        return metrics

batch_size=1
cfg = Config(path='petr_vovnet_gridmask_p4_800x320.yml', batch_size=batch_size)
dic = cfg.to_dict()
batch_size = dic.pop('batch_size')
dic.update({
   
   'dataloader_fn': {
   
   
        'batch_size': batch_size,
        'num_workers': 1}})
dic['checkpoint'] = None
dic['resume'] = False
trainer = CustomTrainer(**dic)
trainer.evaluate()
2_gen_graph_0.py
import argparse
import sys
import numpy as np
import argparse
import paddle.static as static
import paddle.fluid.core as core

def new_prepend_feed_ops(inference_program,
                     feed_target_names,
                     feed_holder_name='feed'):
    import paddle.fluid.core as core
    if len(feed_target_names) == 0:
        return
    global_block = inference_program.global_block()
    feed_var = global_block.create_var(
        name=feed_holder_name,
        type=core.VarDesc.VarType.FEED_MINIBATCH,
        persistable=True)
    for i, name in enumerate(feed_target_names):
        if not global_block.has_var(name):
            continue
        out = global_block.var(name)
        global_block._prepend_op(
            type='feed',
            inputs={
   
   'X': [feed_var]},
            outputs={
   
   'Out': [out]},
            attrs={
   
   'col': i})

def append_fetch_ops(program, fetch_target_names, fetch_holder_name='fetch'):
    import paddle.fluid.core as core
    global_block = program.global_block()
    fetch_var = global_block.create_var(
        name=fetch_holder_name,type=core.VarDesc.VarType.FETCH_LIST,persistable=True)
    for i, name in enumerate(fetch_target_names):
        global_block.append_op(
            type='fetch',
            inputs={
   
   'X': [name]},
            outputs={
   
   'Out': [fetch_var]},
            attrs={
   
   'col': i})

def insert_fetch(program, fetchs, fetch_holder_name="fetch"):
    global_block = program.global_block()
    need_to_remove_op_index = list()
    for i, op in enumerate(global_block.ops):
        if op.type == 'fetch':
            need_to_remove_op_index.append(i)
    for index in need_to_remove_op_index[::-1]:
        global_block._remove_op(index)
    program.desc.flush()
    append_fetch_ops(program, fetchs, fetch_holder_name)
    
def process_old_ops_desc(program):
    for i in range(len(program.blocks[0].ops)):
        if program.blocks[0].ops[i].type == "matmul":
            if not program.blocks[0].ops[i].has_attr("head_number"):
                program.blocks[0].ops[i]._set_attr("head_number", 1)


def infer_shape(program, input_shape_dict):
    paddle.enable_static()
    for k, v in input_shape_dict.items():
        program.blocks[0].var(k).desc.set_shape(v)
    for i in range(len(program.blocks)):
        for j in range(len(program.blocks[0].ops)):
            try:
                program.blocks[i].ops[j].desc.infer_shape(program.blocks[i].desc)
            except:
                pass

def replace_shape_0_with_constant(program, constant_value):
    global_block = program.global_block()
    if not global_block.has_var("shape_0.tmp_0"):
        print("Warning: shape_0.tmp_0 not found in program, skipping replacement.")
        return

    target_var = global_block.var("shape_0.tmp_0")
    
    # 查找产生该变量的操作
    op_index_to_remove = None
    for idx, op in enumerate(global_block.ops):
        if "shape_0.tmp_0" in op.output_arg_names:
            op_index_to_remove = idx
            break
    
    # 移除原有操作
    if op_index_to_remove is not None:
        global_block._remove_op(op_index_to_remove)
    
    # 创建常量值
    constant_tensor = np.array(constant_value, dtype=np.int32)
    
    # 在计算图开头插入常量操作
    insert_idx = 0
    for idx, op in enumerate(global_block.ops):
        if op.type == 'feed':
            insert_idx = idx + 1
    
    global_block._insert_op(
        index=insert_idx,
        
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hi20240217

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值