Advertisement
kopyl

Untitled

Sep 25th, 2024
23
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.76 KB | None | 0 0
  1. [rank1]:[E925 12:18:45.826862671 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
  2. [rank1]:[E925 12:18:45.864918762 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
  3. [rank0]:[E925 12:18:45.008346520 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
  4. [rank0]:[E925 12:18:45.008562142 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
  5. x2-h100:3217:4573 [0] NCCL INFO [Service thread] Connection closed by localRank 0
  6. x2-h100:3218:4571 [1] NCCL INFO [Service thread] Connection closed by localRank 1
  7. x2-h100:3218:5134 [1] NCCL INFO comm 0x2f5704c0 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
  8. [rank1]:[E925 12:18:45.422686935 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  9. [rank1]:[E925 12:18:45.422707275 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
  10. [rank1]:[E925 12:18:45.432676449 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
  11. [rank1]:[E925 12:18:45.432702799 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1237, last enqueued NCCL work: 1237, last completed NCCL work: 1236.
  12. [rank1]:[E925 12:18:45.432711599 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  13. x2-h100:3217:3217 [0] NCCL INFO comm 0x27143e10 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
  14. [rank0]:[E925 12:18:45.747463316 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  15. [rank0]:[E925 12:18:45.747476796 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  16. [rank0]:[E925 12:18:45.748780190 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
  17. [rank0]:[E925 12:18:45.748807500 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
  18. [rank0]:[E925 12:18:45.748820300 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  19. [rank1]: Traceback (most recent call last):
  20. [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
  21. [rank1]: main(args)
  22. [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
  23. [rank1]: accelerator.backward(loss)
  24. [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
  25. [rank1]: loss.backward(**kwargs)
  26. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
  27. [rank1]: torch.autograd.backward(
  28. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
  29. [rank1]: _engine_run_backward(
  30. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
  31. [rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
  32. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1116, in unpack_hook
  33. [rank1]: frame.recompute_fn(*args)
  34. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 1400, in recompute_fn
  35. [rank1]: fn(*args, **kwargs)
  36. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
  37. [rank1]: return self._call_impl(*args, **kwargs)
  38. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
  39. [rank1]: return forward_call(*args, **kwargs)
  40. [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 820, in forward
  41. [rank1]: return model_forward(*args, **kwargs)
  42. [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 808, in __call__
  43. [rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
  44. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
  45. [rank1]: return func(*args, **kwargs)
  46. [rank1]: File "/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 532, in forward
  47. [rank1]: hidden_states = block(
  48. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
  49. [rank1]: return self._call_impl(*args, **kwargs)
  50. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
  51. [rank1]: return forward_call(*args, **kwargs)
  52. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
  53. [rank1]: args, kwargs = _pre_forward(
  54. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 381, in _pre_forward
  55. [rank1]: unshard_fn(state, handle)
  56. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 416, in _pre_forward_unshard
  57. [rank1]: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
  58. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
  59. [rank1]: handle.unshard()
  60. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
  61. [rank1]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
  62. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
  63. [rank1]: dist.all_gather_into_tensor(
  64. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  65. [rank1]: return func(*args, **kwargs)
  66. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
  67. [rank1]: work.wait()
  68. [rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1237, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
  69. [rank0]: Traceback (most recent call last):
  70. [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
  71. [rank0]: main(args)
  72. [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
  73. [rank0]: accelerator.save_state(save_path)
  74. [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
  75. [rank0]: save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
  76. [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
  77. [rank0]: dist_cp.save_state_dict(
  78. [rank0]: File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
  79. [rank0]: return arg(*args, **kwargs)
  80. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
  81. [rank0]: return _save_state_dict(
  82. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
  83. [rank0]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
  84. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
  85. [rank0]: all_data = self.gather_object(local_data)
  86. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
  87. [rank0]: dist.gather_object(
  88. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  89. [rank0]: return func(*args, **kwargs)
  90. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
  91. [rank0]: all_gather(object_size_list, local_size, group=group)
  92. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  93. [rank0]: return func(*args, **kwargs)
  94. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
  95. [rank0]: work.wait()
  96. [rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600008 milliseconds before timing out.
  97. Steps: 50%|███████ | 1/2 [1:01:47<1:01:47, 3707.41s/it, loss=0.327, lr=1]
  98. W0925 12:19:09.102940 139813630109504 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3217 closing signal SIGTERM
  99. E0925 12:19:19.095820 139813630109504 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3218) of binary: /usr/bin/python
  100. Traceback (most recent call last):
  101. File "/usr/local/bin/accelerate", line 8, in <module>
  102. sys.exit(main())
  103. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
  104. args.func(args)
  105. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
  106. multi_gpu_launcher(args)
  107. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
  108. distrib_run.run(args)
  109. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
  110. elastic_launch(
  111. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
  112. return launch_agent(self._config, self._entrypoint, list(args))
  113. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  114. raise ChildFailedError(
  115. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  116. ============================================================
  117. examples/dreambooth/train_dreambooth_flux.py FAILED
  118. ------------------------------------------------------------
  119. Failures:
  120. <NO_OTHER_FAILURES>
  121. ------------------------------------------------------------
  122. Root Cause (first observed failure):
  123. [0]:
  124. time : 2024-09-25_12:19:09
  125. host : x2-h100.internal.cloudapp.net
  126. rank : 1 (local_rank: 1)
  127. exitcode : 1 (pid: 3218)
  128. error_file: <N/A>
  129. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  130. ============================================================
  131.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement