最近想看看qwen-vl系列在本地到底是怎么部署的,但是在csdn上找到的代码都是将qwen-vl弄成一个服务器的形式来调试的,因此还有点不适应,终于找到了一个例子,我放到下面:
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
# Initialize the LLM with multimodal capabilities
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
)
# Define generation parameters
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
repetition_penalty=1.05,
max_tokens=256,
stop_token_ids=[],
)
# Example message with video content
video_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "text", "text": "Please summarize this video"},
{
"type": "video",
"video": "https://example.com/video.mp4",
"total_pixels": 20480 * 28 * 28,
"min_pixels": 16 * 28 * 28
}
]
},
]
# Process the message
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
video_messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
video_messages, return_video_kwargs=True
)
# Prepare multimodal data for vLLM
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
# Create input for vLLM
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
"mm_processor_kwargs": video_kwargs,
}
# Generate response
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
突然发现,这个就是qwen2.5-vl官方仓库的例子啊……
上面的例子只给出了单条数据的处理,下面我将给出多条数据一起推理的参考代码:
# 假设我们已经加载好模型和 processor 了,加载的流程和前面的代码一样
# 后面的代码中,模型是变量 model,processor 就是变量 processor
# 还有 image_paths 和 user_query_ls 两个没有给出的变量,里面就是两个列表,分别放着image路径和给模型的指令
# 预处理部分,和前面一样,不过每次处理多个数据
messages = [
[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
{"type": "image", "image": image_paths[i]}, # 如果是video,还需要对应的修改
{"type": "text", "text": user_query_ls[i]},
],
}
] for i in range(len(indexes))
]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
mm_data = [
{
"image": image_inputs[i] if image_inputs is not None else None,
# "video": video_inputs[i] if video_inputs is not None else None, # 这里因为我没有用video,就注释掉了,如果有video输入,还是要给出的
} for i in range(len(texts))
]
llm_inputs_ls = [
{
"prompt": texts[i],
"multi_modal_data": mm_data[i],
} for i in range(len(texts))
]
outputs = model.generate(
llm_inputs_ls,
sampling_params=SamplingParams(
max_tokens=10240,
temperature=0.1,
top_p=0.95,
# top_k=40,
),
use_tqdm=False,
)
output_texts = [
output.outputs[0].text for output in outputs
]
我看到有人说即使输入一批数据,qwen2.5-vl在vllm中也是一条一条处理的:https://github.com/QwenLM/Qwen2.5-VL/issues/882
这也是为啥qwen2.5-vl建议使用服务器的方式来进行数据的处理。
所以说,之后我也得这么做了,使用服务器的方式来调用模型。
参考链接: