You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I try to fine-tune stageB with some additional parameters, the process stucks after certain iterations. I am also confused about the information of this issue. Please refer to the error information bellow.
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800931 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800931 milliseconds before timing out. e01-cn-wwo3eteze1j:26:205 [0] NCCL INFO comm 0x2dc99930 rank 0 nranks 8 cudaDev 0 busId 8000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800477 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800477 milliseconds before timing out. e01-cn-wwo3eteze1j:32:204 [0] NCCL INFO comm 0x2eca4930 rank 6 nranks 8 cudaDev 6 busId 1a3000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. e01-cn-wwo3eteze1j:27:213 [0] NCCL INFO comm 0x2e9dad70 rank 1 nranks 8 cudaDev 1 busId 7e000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800496 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800496 milliseconds before timing out. e01-cn-wwo3eteze1j:28:228 [0] NCCL INFO comm 0x8ca8c770 rank 2 nranks 8 cudaDev 2 busId a2000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. e01-cn-wwo3eteze1j:29:195 [0] NCCL INFO comm 0x39b401e0 rank 3 nranks 8 cudaDev 3 busId c6000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800479 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800479 milliseconds before timing out.
The text was updated successfully, but these errors were encountered:
When I try to fine-tune stageB with some additional parameters, the process stucks after certain iterations. I am also confused about the information of this issue. Please refer to the error information bellow.
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800931 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800931 milliseconds before timing out. e01-cn-wwo3eteze1j:26:205 [0] NCCL INFO comm 0x2dc99930 rank 0 nranks 8 cudaDev 0 busId 8000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800477 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800477 milliseconds before timing out. e01-cn-wwo3eteze1j:32:204 [0] NCCL INFO comm 0x2eca4930 rank 6 nranks 8 cudaDev 6 busId 1a3000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. e01-cn-wwo3eteze1j:27:213 [0] NCCL INFO comm 0x2e9dad70 rank 1 nranks 8 cudaDev 1 busId 7e000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800496 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800496 milliseconds before timing out. e01-cn-wwo3eteze1j:28:228 [0] NCCL INFO comm 0x8ca8c770 rank 2 nranks 8 cudaDev 2 busId a2000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800483 milliseconds before timing out. e01-cn-wwo3eteze1j:29:195 [0] NCCL INFO comm 0x39b401e0 rank 3 nranks 8 cudaDev 3 busId c6000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800479 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17013092, OpType=_REDUCE_SCATTER_BASE, NumelIn=83200, NumelOut=10400, Timeout(ms)=1800000) ran for 1800479 milliseconds before timing out.
The text was updated successfully, but these errors were encountered: