-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Support sharding for auto_trainer #8164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sharding for auto_trainer #8164
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8164 +/- ##
===========================================
- Coverage 55.15% 55.15% -0.01%
===========================================
Files 601 601
Lines 91764 91764
===========================================
- Hits 50614 50611 -3
- Misses 41150 41153 +3 ☔ View full report in Codecov by Sentry. |
…nto dev/support_sharding_auto_trainer
…nto dev/support_sharding_auto_trainer
if self.sharding_parallel_degree == -1: | ||
if len(self.sharding) > 0: | ||
self.sharding_parallel_degree = self.data_parallel_degree | ||
self.sharding_parallel_degree = world_size // ( | ||
self.tensor_parallel_degree * self.sep_parallel_degree * self.pipeline_parallel_degree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sep_parallel_degree is not supported now, raise error if it is set in auto_parallel mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be fixed in the next pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
paddlenlp/trainer/trainer.py
Outdated
if ( | ||
ShardingOption.SHARD_OP in self.args.sharding | ||
and not is_new_version_sharding_stage1_optimizer() | ||
and not self.args.enable_auto_parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 is_new_version_sharding_stage1_optimizer 对于当前版本的paddle应该都是True吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个暂时不确定,这里的改动是针对在动静统一组网的情况,也就是开启 enable_auto_parallel 的情况
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你是测试在这里遇到了什么问题吗?
if ShardingOption.OFFLOAD in self.sharding: | ||
warnings.warn("`offload` is not supported NOW!") | ||
|
||
strategy = fleet.auto.Strategy() | ||
if self.data_parallel_degree > 1: | ||
if self.dataset_world_size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? 这里 sharding 和 dp 都当成dp吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
本 PR 改动之前,data_parallel_degree = world_size // (tensor_parallel_degree*pipeline_parallel_degree)
,本 PR 适配后,data_parallel_degree = world_size // (tensor_parallel_degree*pipeline_parallel_degree*sharding_parallel_degree)
,因此为了和之前的逻辑保持一致,这里改为了 dataset_world_size(data_parallel_degree*sharding_parallel_degree)
此外我理解,开启 sharding 之后,也相当于多了对应的 data_parallel,因此这里也应该使用 dataset_world_size
paddlenlp/trainer/trainer.py
Outdated
if ( | ||
ShardingOption.SHARD_OP in self.args.sharding | ||
and not is_new_version_sharding_stage1_optimizer() | ||
and not self.args.enable_auto_parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你是测试在这里遇到了什么问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Others
Description
为自动并行动静统一组网适配 Sharding 策略:


测试:以 A100-40G 机器,对 Llama2 PP4-VPP2-MP1-Sharding_degree2 模型:分别验证 sharding_stage1、2、3下的精度、性能(缩小了 num_hidden_layers=8)