align llm_engine and async_engine step method. #1081

esmeetu · 2023-09-18T13:55:01Z

#1059 might makes async engine not work.
This PR align llm_engine and async_engine's step behavior like #1029.

zhuohan123

LGTM! Thanks for catching this!

…llm-project#1081) This PR reverts vllm-project#691 that leads to `AttributeError: 'tuple' object has no attribute 'reshape'` for Qwen2.5-VL. ## Test server: ``` python -m vllm.entrypoints.openai.api_server --port 8080 --model Qwen/Qwen2.5-VL-3B-Instruct --tensor-parallel-size 1 --max-num-seqs 128 --dtype bfloat16 --gpu-memory-util 0.9 --max-num-batched-tokens 32768 --max-model-len 32768 --block-size 128 ``` Client: ``` python benchmark_serving.py --backend openai-chat --model Qwen/Qwen2.5-VL-3B-Instruct --trust-remote-code --port 8080 --endpoint /v1/chat/completions --dataset-path lmarena-ai/vision-arena-bench-v0.1 --dataset-name hf --hf-split train --num-prompts 40 --request-rate inf --seed 0 --ignore_eos ``` ### as is ``` ERROR 04-14 16:51:14 engine.py:139] AttributeError("'tuple' object has no attribute 'reshape'") ERROR 04-14 16:51:14 engine.py:139] Traceback (most recent call last): ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 137, in start ERROR 04-14 16:51:14 engine.py:139] self.run_engine_loop() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 200, in run_engine_loop ERROR 04-14 16:51:14 engine.py:139] request_outputs = self.engine_step() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 218, in engine_step ERROR 04-14 16:51:14 engine.py:139] raise e ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 209, in engine_step ERROR 04-14 16:51:14 engine.py:139] return self.engine.step() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/llm_engine.py", line 1380, in step ERROR 04-14 16:51:14 engine.py:139] outputs = self.model_executor.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/executor/executor_base.py", line 138, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = self.collective_rpc("execute_model", ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/executor/uniproc_executor.py", line 58, in collective_rpc ERROR 04-14 16:51:14 engine.py:139] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/utils.py", line 2323, in run_method ERROR 04-14 16:51:14 engine.py:139] return func(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_worker.py", line 294, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = LocalOrDistributedWorkerBase.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/worker_base.py", line 418, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = self.model_runner.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-14 16:51:14 engine.py:139] return func(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 2697, in execute_model ERROR 04-14 16:51:14 engine.py:139] hidden_states = self.model.forward( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 745, in forward ERROR 04-14 16:51:14 engine.py:139] return wrapped_hpugraph_forward( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 610, in wrapped_hpugraph_forward ERROR 04-14 16:51:14 engine.py:139] outputs = orig_fwd(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 423, in forward ERROR 04-14 16:51:14 engine.py:139] hidden_states = self.model(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1742, in _wrapped_call_impl ERROR 04-14 16:51:14 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl ERROR 04-14 16:51:14 engine.py:139] return inner() ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1796, in inner ERROR 04-14 16:51:14 engine.py:139] result = forward_call(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1104, in forward ERROR 04-14 16:51:14 engine.py:139] inputs_embeds = self.get_input_embeddings_v0( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1037, in get_input_embeddings_v0 ERROR 04-14 16:51:14 engine.py:139] inputs_embeds = merge_multimodal_embeddings( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 448, in merge_multimodal_embeddings ERROR 04-14 16:51:14 engine.py:139] return _hpu_merge_multimodal_embeddings( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 674, in _hpu_merge_multimodal_embeddings ERROR 04-14 16:51:14 engine.py:139] multimodal_embeddings = multimodal_embeddings.reshape(-1, hidden_size) ``` ### with this PR ``` 100%|██████████| 1/1 [00:01<00:00, 1.14s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Benchmark duration (s): 1.14 Total input tokens: 52 Total generated tokens: 128 Request throughput (req/s): 0.88 Output token throughput (tok/s): 112.62 Total Token throughput (tok/s): 158.37 ---------------Time to First Token---------------- Mean TTFT (ms): 169.75 Median TTFT (ms): 169.75 P99 TTFT (ms): 169.75 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 7.61 Median TPOT (ms): 7.61 P99 TPOT (ms): 7.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 7.55 Median ITL (ms): 7.59 P99 ITL (ms): 8.10 ================================================== ``` Co-authored-by: Michał Kuligowski <[email protected]>

align llm_engine and async_engine.

3c7a20b

esmeetu changed the title ~~align llm_engine and async_engine step.~~ align llm_engine and async_engine step method. Sep 18, 2023

zhuohan123 approved these changes Sep 18, 2023

View reviewed changes

zhuohan123 merged commit 95592fa into vllm-project:main Sep 18, 2023

esmeetu deleted the fix-engine branch September 22, 2023 13:49

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

align llm_engine and async_engine. (vllm-project#1081)

ca9c656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

align llm_engine and async_engine step method. #1081

align llm_engine and async_engine step method. #1081

Uh oh!

esmeetu commented Sep 18, 2023 •

edited

Loading

Uh oh!

zhuohan123 left a comment

Uh oh!

Uh oh!

Uh oh!

align llm_engine and async_engine step method. #1081

align llm_engine and async_engine step method. #1081

Uh oh!

Conversation

esmeetu commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

esmeetu commented Sep 18, 2023 •

edited

Loading