Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- [rank1]:[E925 12:18:45.826862671 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
- [rank1]:[E925 12:18:45.864918762 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
- [rank0]:[E925 12:18:45.008346520 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
- [rank0]:[E925 12:18:45.008562142 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
- x2-h100:3217:4573 [0] NCCL INFO [Service thread] Connection closed by localRank 0
- x2-h100:3218:4571 [1] NCCL INFO [Service thread] Connection closed by localRank 1
- x2-h100:3218:5134 [1] NCCL INFO comm 0x2f5704c0 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
- [rank1]:[E925 12:18:45.422686935 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank1]:[E925 12:18:45.422707275 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
- [rank1]:[E925 12:18:45.432676449 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
- [rank1]:[E925 12:18:45.432702799 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
- [rank1]:[E925 12:18:45.432711599 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- x2-h100:3217:3217 [0] NCCL INFO comm 0x27143e10 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
- [rank0]:[E925 12:18:45.747463316 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank0]:[E925 12:18:45.747476796 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
- [rank0]:[E925 12:18:45.748780190 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
- [rank0]:[E925 12:18:45.748807500 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
- [rank0]:[E925 12:18:45.748820300 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank1]: Traceback (most recent call last):
- [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
- [rank1]: main(args)
- [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
- [rank1]: accelerator.backward(loss)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
- [rank1]: loss.backward(**kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
- [rank1]: torch.autograd.backward(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
- [rank1]: _engine_run_backward(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
- [rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1116, in unpack_hook
- [rank1]: frame.recompute_fn(*args)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1400, in recompute_fn
- [rank1]: fn(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
- [rank1]: return self._call_impl(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
- [rank1]: return forward_call(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 820, in forward
- [rank1]: return model_forward(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 808, in __call__
- [rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
- [rank1]: return func(*args, **kwargs)
- [rank1]: File "/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 532, in forward
- [rank1]: hidden_states = block(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
- [rank1]: return self._call_impl(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
- [rank1]: return forward_call(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
- [rank1]: args, kwargs = _pre_forward(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 381, in _pre_forward
- [rank1]: unshard_fn(state, handle)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 416, in _pre_forward_unshard
- [rank1]: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
- [rank1]: handle.unshard()
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
- [rank1]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
- [rank1]: dist.all_gather_into_tensor(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank1]: return func(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
- [rank1]: work.wait()
- [rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
- [rank0]: Traceback (most recent call last):
- [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
- [rank0]: main(args)
- [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
- [rank0]: accelerator.save_state(save_path)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
- [rank0]: save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
- [rank0]: dist_cp.save_state_dict(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
- [rank0]: return arg(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
- [rank0]: return _save_state_dict(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
- [rank0]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
- [rank0]: all_data = self.gather_object(local_data)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
- [rank0]: dist.gather_object(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank0]: return func(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
- [rank0]: all_gather(object_size_list, local_size, group=group)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank0]: return func(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
- [rank0]: work.wait()
- [rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
- Steps: 50%|███████ | 1/2 [1:01:47<1:01:47, 3707.41s/it, loss=0.327, lr=1]
- W0925 12:19:09.102940 139813630109504 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3217 closing signal SIGTERM
- E0925 12:19:19.095820 139813630109504 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3218) of binary: /usr/bin/python
- Traceback (most recent call last):
- File "/usr/local/bin/accelerate", line 8, in <module>
- sys.exit(main())
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
- args.func(args)
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
- multi_gpu_launcher(args)
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
- distrib_run.run(args)
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
- elastic_launch(
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- examples/dreambooth/train_dreambooth_flux.py FAILED
- ------------------------------------------------------------
- Failures:
- <NO_OTHER_FAILURES>
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2024-09-25_12:19:09
- host : x2-h100.internal.cloudapp.net
- rank : 1 (local_rank: 1)
- exitcode : 1 (pid: 3218)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement