阿里开源 SenseVoice：打造 STT 语音转文本实战应用

原创已于 2025-06-23 14:09:03 修改 · 1.3k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#STT #SenseVoice #语音转文本 #阿里 #Python

于 2025-05-23 09:22:36 首次发布

人工智能专栏收录该内容

2 篇文章

订阅专栏

1、引言

1.1、SenseVoice 简介

阿里通义实验室推出音频基座大模型 FunAudioLLM，包含 SenseVoice 和 CosyVoice 两大模型。
在这里插入图片描述

SenseVoice 专注于高精度多语言语音识别、情感辨识和音频事件检测

多语言识别

采用超过40万小时数据训练，支持超过50种语言，识别效果上优于Whisper模型。

富文本识别

具备优秀的情感识别，能够在测试数据上达到和超过目前最佳情感识别模型的效果。
支持声音事件检测能力，支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。

高效推理

SenseVoice-Small模型采用非自回归端到端框架，推理延迟极低，10s音频推理仅耗时70ms，15倍优于Whisper-Large。

微调定制

具备便捷的微调脚本与策略，方便用户根据业务场景修复长尾样本问题。

服务部署

具有完整的服务部署链路，支持多并发请求，支持客户端语言有，python、c++、html、java与c#等。

在这里插入图片描述

1.2、SenseVoice 资源

开源仓库：https://github.com/FunAudioLLM/SenseVoice

在这里插入图片描述

模型地址：https://modelscope.cn/models/iic/SenseVoiceSmall/files

在这里插入图片描述

在线体验：https://www.modelscope.cn/studios/iic/SenseVoice

在这里插入图片描述

2、安装

2.1、安装 Anaconda

Linux 安装 Anaconda 参考文章

MAC 安装 Anaconda 参考文章

Windows 安装 Anaconda 参考文章

2.2、创建独立环境

# 创建一个名为 wn_sensevoice 的环境，并指定在该环境中安装 Python 3.10 版本
conda create -n wn_sensevoice -y python=3.10

# 激活并选择环境
conda activate wn_sensevoice

2.3、下载 SenseVoiceSmall 模型

魔搭平台下载

# 安装魔搭社区
pip install modelscope

# 下载 CosyVoice2-0.5B 模型到本地指定目录（替换自己本地路径）
modelscope download --model iic/SenseVoiceSmall --local_dir C:\Users\woniu\model\FunAudioLLM\SenseVoiceSmall

2.4、安装 Python 项目依赖项

# 安装 CPU 版本的 PyTorch、torchvision 和 torchaudio
pip install torch torchvision torchaudio

# 安装 ffmpeg
conda install ffmpeg

# 安装 funasr
pip install funasr

2.5、GPU 加速（根据自己电脑配置）

# 验证 GPU 环境是否可用
python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"

# 卸载 CPU 版本 torch
pip uninstall torch torchvision torchaudio

# 安装 GPU 版本 torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

3、项目

3.1、创建新项目

使用 PyCharm 工具创建一个新的 Python 项目，环境选择刚才创建的新环境
在这里插入图片描述

3.2、示例代码

import torch
from typing import Optional
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

class SenseVoiceSmallDemo:
    def __init__(self):
        print("SenseVoiceSmallDemo 初始化")

        # 指定模型路径
        model_dir = r"C:\Users\woniu\model\FunAudioLLM\SenseVoiceSmall"

        # 检查当前环境中是否支持 NVIDIA 的 CUDA 平台
        device = "cuda" if torch.cuda.is_available() else "cpu"

        # 初始化模型
        # trust_remote_code 当设置为 True 时，表示允许加载远程代码
        # remote_code 指定一个本地文件路径（如 ./model.py），该文件包含与远程模型相关的自定义代码。
        # vad_model 语音活动检测（Voice Activity Detection，VAD）模型的名称或标识符（如 fsmn-vad）
        self.model = AutoModel(
            model=model_dir,
            disable_update=True,  # 禁用版本检查
            trust_remote_code=True,
            vad_model="fsmn-vad",
            vad_kwargs={"max_single_segment_time": 30000},
            device="cuda:0",  # 指定使用 GPU 设备
        )

    # 语音文件转文本
    def audio_to_text(self, file_path: Optional[str]):
        # 进行语音识别
        res = self.model.generate(
            input=file_path,
            language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
            use_itn=True,
            batch_size_s=60,
            merge_vad=True,  #
            merge_length_s=15,
        )

        # 处理结果
        if res and res[0]["text"]:
            text = rich_transcription_postprocess(res[0]["text"])
            if text:
                return text
        return ""

if __name__ == "__main__":
    wnDemo = SenseVoiceSmallDemo()
    file_path = r"..\\CosyVoice2\\zero_shot_0.wav"
    text_content = wnDemo.audio_to_text(file_path)
    print(text_content)