Closed
Description
Description
When num_nodes = 32 and num_nodes = 64, I have observed training crash at the very first step and the error logged is:
2023-08-25 11:06:39.330561: E external/xla/xla/service/gpu/nccl_utils.cc:191] Aborting communicator: 0x5555a108ff90 due to async NCCL error: remote process exited or there was a network error
Only 2-3 out of all the processes segfault with the above error, and the others abort.
Metadata
Metadata
Assignees
Labels
No labels