Support sharding for auto_trainer #8164

zhangbo9674 · 2024-03-21T03:08:37Z

PR types

New features

PR changes

Others

Description

为自动并行动静统一组网适配 Sharding 策略：
测试：以 A100-40G 机器，对 Llama2 PP4-VPP2-MP1-Sharding_degree2 模型：分别验证 sharding_stage1、2、3下的精度、性能（缩小了 num_hidden_layers=8）

paddle-bot · 2024-03-21T03:08:42Z

Thanks for your contribution!

codecov · 2024-03-21T03:34:29Z

Codecov Report

Attention: Patch coverage is 7.14286% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 55.15%. Comparing base (6b8f7f9) to head (c57e9a0).
Report is 2 commits behind head on develop.

Files	Patch %	Lines
paddlenlp/trainer/training_args.py	0.00%	7 Missing ⚠️
paddlenlp/trainer/auto_trainer.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8164      +/-   ##
===========================================
- Coverage    55.15%   55.15%   -0.01%     
===========================================
  Files          601      601              
  Lines        91764    91764              
===========================================
- Hits         50614    50611       -3     
- Misses       41150    41153       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nto dev/support_sharding_auto_trainer

zhiqiu · 2024-03-29T03:55:00Z

paddlenlp/trainer/training_args.py

            if self.sharding_parallel_degree == -1:
                if len(self.sharding) > 0:
-                    self.sharding_parallel_degree = self.data_parallel_degree
+                    self.sharding_parallel_degree = world_size // (
+                        self.tensor_parallel_degree * self.sep_parallel_degree * self.pipeline_parallel_degree


sep_parallel_degree is not supported now, raise error if it is set in auto_parallel mode

it can be fixed in the next pr

zhiqiu

LGTM

ZHUI · 2024-03-29T06:05:26Z

paddlenlp/trainer/trainer.py

+            if (
+                ShardingOption.SHARD_OP in self.args.sharding
+                and not is_new_version_sharding_stage1_optimizer()
+                and not self.args.enable_auto_parallel


这个 is_new_version_sharding_stage1_optimizer 对于当前版本的paddle应该都是True吧

这个暂时不确定，这里的改动是针对在动静统一组网的情况，也就是开启 enable_auto_parallel 的情况

你是测试在这里遇到了什么问题吗？

ZHUI · 2024-03-29T07:22:16Z

paddlenlp/trainer/training_args.py

            if ShardingOption.OFFLOAD in self.sharding:
                warnings.warn("`offload` is not supported NOW!")

            strategy = fleet.auto.Strategy()
-            if self.data_parallel_degree > 1:
+            if self.dataset_world_size > 1:


why? 这里 sharding 和 dp 都当成dp吗？

本 PR 改动之前，data_parallel_degree = world_size // (tensor_parallel_degree*pipeline_parallel_degree)，本 PR 适配后，data_parallel_degree = world_size // (tensor_parallel_degree*pipeline_parallel_degree*sharding_parallel_degree)，因此为了和之前的逻辑保持一致，这里改为了 dataset_world_size（data_parallel_degree*sharding_parallel_degree）
此外我理解，开启 sharding 之后，也相当于多了对应的 data_parallel，因此这里也应该使用 dataset_world_size

ZHUI · 2024-03-29T07:54:36Z

paddlenlp/trainer/trainer.py

+            if (
+                ShardingOption.SHARD_OP in self.args.sharding
+                and not is_new_version_sharding_stage1_optimizer()
+                and not self.args.enable_auto_parallel


你是测试在这里遇到了什么问题吗？

ZHUI

LGTM

zhiqiu

LGTM

add

61b2cf5

zhangbo9674 added 8 commits March 21, 2024 13:12

fix

e34003e

refine code

e84e138

refine

916b58d

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

534f7eb

…nto dev/support_sharding_auto_trainer

fix

6cf66eb

fix

f8d0862

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

2b8ff62

…nto dev/support_sharding_auto_trainer

fix

d18ae9c

zhiqiu reviewed Mar 29, 2024

View reviewed changes

zhiqiu previously approved these changes Mar 29, 2024

View reviewed changes

ZHUI reviewed Mar 29, 2024

View reviewed changes

ZHUI previously approved these changes Mar 29, 2024

View reviewed changes

refine

c57e9a0

zhangbo9674 dismissed stale reviews from ZHUI and zhiqiu via c57e9a0 March 29, 2024 08:06

ZHUI approved these changes Mar 29, 2024

View reviewed changes

zhiqiu approved these changes Mar 29, 2024

View reviewed changes

wawltor merged commit 7b493a8 into PaddlePaddle:develop Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support sharding for auto_trainer #8164

Support sharding for auto_trainer #8164

Uh oh!

zhangbo9674 commented Mar 21, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 21, 2024

Uh oh!

codecov bot commented Mar 21, 2024 •

edited

Loading

Uh oh!

zhiqiu Mar 29, 2024

Uh oh!

zhiqiu Mar 29, 2024

Uh oh!

zhiqiu left a comment

Uh oh!

ZHUI Mar 29, 2024

Uh oh!

zhangbo9674 Mar 29, 2024

Uh oh!

ZHUI Mar 29, 2024

Uh oh!

ZHUI Mar 29, 2024

Uh oh!

zhangbo9674 Mar 29, 2024 •

edited by ZHUI

Loading

Uh oh!

ZHUI Mar 29, 2024

Uh oh!

ZHUI left a comment

Uh oh!

zhiqiu left a comment

Uh oh!

Uh oh!

Support sharding for auto_trainer #8164

Support sharding for auto_trainer #8164

Uh oh!

Conversation

zhangbo9674 commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Mar 21, 2024

Uh oh!

codecov bot commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhiqiu Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

zhiqiu Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 Mar 29, 2024 • edited by ZHUI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhangbo9674 commented Mar 21, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

zhangbo9674 Mar 29, 2024 •

edited by ZHUI

Loading