Untitled

[rank1]:[E925 12:18:45.826862671 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
[rank1]:[E925 12:18:45.864918762 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank0]:[E925 12:18:45.008346520 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
[rank0]:[E925 12:18:45.008562142 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
x2-h100:3217:4573 [0] NCCL INFO [Service thread] Connection closed by localRank 0
x2-h100:3218:4571 [1] NCCL INFO [Service thread] Connection closed by localRank 1
x2-h100:3218:5134 [1] NCCL INFO comm 0x2f5704c0 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
[rank1]:[E925 12:18:45.422686935 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E925 12:18:45.422707275 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E925 12:18:45.432676449 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
[rank1]:[E925 12:18:45.432702799 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
[rank1]:[E925 12:18:45.432711599 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
x2-h100:3217:3217 [0] NCCL INFO comm 0x27143e10 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
[rank0]:[E925 12:18:45.747463316 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E925 12:18:45.747476796 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E925 12:18:45.748780190 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
[rank0]:[E925 12:18:45.748807500 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
[rank0]:[E925 12:18:45.748820300 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]: Traceback (most recent call last):
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank1]:     main(args)
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
[rank1]:     accelerator.backward(loss)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1116, in unpack_hook
[rank1]:     frame.recompute_fn(*args)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1400, in recompute_fn
[rank1]:     fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 820, in forward
[rank1]:     return model_forward(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 808, in __call__
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 532, in forward
[rank1]:     hidden_states = block(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
[rank1]:     args, kwargs = _pre_forward(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 381, in _pre_forward
[rank1]:     unshard_fn(state, handle)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 416, in _pre_forward_unshard
[rank1]:     _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
[rank1]:     handle.unshard()
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
[rank1]:     padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
[rank1]:     dist.all_gather_into_tensor(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
[rank1]:     work.wait()
[rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
[rank0]: Traceback (most recent call last):
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank0]:     main(args)
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
[rank0]:     accelerator.save_state(save_path)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
[rank0]:     save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
[rank0]:     dist_cp.save_state_dict(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank0]:     return arg(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
[rank0]:     return _save_state_dict(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
[rank0]:     central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
[rank0]:     all_data = self.gather_object(local_data)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
[rank0]:     dist.gather_object(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
[rank0]:     all_gather(object_size_list, local_size, group=group)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
[rank0]:     work.wait()
[rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
Steps:  50%|███████       | 1/2 [1:01:47<1:01:47, 3707.41s/it, loss=0.327, lr=1]
W0925 12:19:09.102940 139813630109504 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3217 closing signal SIGTERM
E0925 12:19:19.095820 139813630109504 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3218) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_12:19:09
  host      : x2-h100.internal.cloudapp.net
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3218)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================