Advertisement
kopyl

Untitled

Sep 25th, 2024
27
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 21.49 KB | None | 0 0
  1. The following values were not passed to `accelerate launch` and had defaults used instead:
  2. `--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
  3. To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  4. [W925 12:22:21.945906518 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
  5. [W925 12:22:21.945939778 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
  6. [W925 12:22:21.963256156 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
  7. [W925 12:22:21.963280176 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
  8. 09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
  9. Num processes: 2
  10. Process index: 0
  11. Local process index: 0
  12. Device: cuda:0
  13.  
  14. Mixed precision type: bf16
  15.  
  16. You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
  17. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  18. You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  19. 09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
  20. Num processes: 2
  21. Process index: 1
  22. Local process index: 1
  23. Device: cuda:1
  24.  
  25. Mixed precision type: bf16
  26.  
  27. Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 14899.84it/s]
  28. Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 6615.62it/s]
  29. Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 7.74it/s]
  30. Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 29888.15it/s]
  31. Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00, 1.33it/s]
  32. Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10098.65it/s]
  33. {'axes_dims_rope'} was not found in config. Values will be initialized to default values.
  34. Using decoupled weight decay
  35. Using decoupled weight decay
  36. x2-h100:118401:118401 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  37. x2-h100:118401:118401 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
  38. x2-h100:118401:118401 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
  39. x2-h100:118401:118401 [0] NCCL INFO cudaDriverVersion 12020
  40. NCCL version 2.20.5+cuda12.4
  41. x2-h100:118402:118402 [1] NCCL INFO cudaDriverVersion 12020
  42. x2-h100:118402:118402 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  43. x2-h100:118402:118402 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
  44. x2-h100:118402:118402 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
  45. x2-h100:118401:119172 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
  46. x2-h100:118401:119172 [0] NCCL INFO Failed to open libibverbs.so[.1]
  47. x2-h100:118401:119172 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  48. x2-h100:118401:119172 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
  49. x2-h100:118402:119173 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
  50. x2-h100:118401:119172 [0] NCCL INFO Using non-device net plugin version 0
  51. x2-h100:118401:119172 [0] NCCL INFO Using network Socket
  52. x2-h100:118402:119173 [1] NCCL INFO Failed to open libibverbs.so[.1]
  53. x2-h100:118402:119173 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  54. x2-h100:118402:119173 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
  55. x2-h100:118402:119173 [1] NCCL INFO Using non-device net plugin version 0
  56. x2-h100:118402:119173 [1] NCCL INFO Using network Socket
  57. x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init START
  58. x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init START
  59. x2-h100:118401:119172 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
  60. x2-h100:118401:119172 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
  61. x2-h100:118402:119173 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
  62. x2-h100:118402:119173 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
  63. x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
  64. x2-h100:118402:119173 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
  65. x2-h100:118402:119173 [1] NCCL INFO P2P Chunksize set to 131072
  66. x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
  67. x2-h100:118401:119172 [0] NCCL INFO Channel 00/04 : 0 1
  68. x2-h100:118401:119172 [0] NCCL INFO Channel 01/04 : 0 1
  69. x2-h100:118401:119172 [0] NCCL INFO Channel 02/04 : 0 1
  70. x2-h100:118401:119172 [0] NCCL INFO Channel 03/04 : 0 1
  71. x2-h100:118401:119172 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
  72. x2-h100:118401:119172 [0] NCCL INFO P2P Chunksize set to 131072
  73. x2-h100:118402:119173 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
  74. x2-h100:118402:119173 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
  75. x2-h100:118402:119173 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
  76. x2-h100:118402:119173 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
  77. x2-h100:118401:119172 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
  78. x2-h100:118401:119172 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
  79. x2-h100:118401:119172 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
  80. x2-h100:118401:119172 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
  81. x2-h100:118401:119172 [0] NCCL INFO Connected all rings
  82. x2-h100:118401:119172 [0] NCCL INFO Connected all trees
  83. x2-h100:118402:119173 [1] NCCL INFO Connected all rings
  84. x2-h100:118402:119173 [1] NCCL INFO Connected all trees
  85. x2-h100:118402:119173 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
  86. x2-h100:118402:119173 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
  87. x2-h100:118401:119172 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
  88. x2-h100:118401:119172 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
  89. x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init COMPLETE
  90. x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init COMPLETE
  91. 09/25/2024 12:22:40 - INFO - __main__ - ***** Running training *****
  92. 09/25/2024 12:22:40 - INFO - __main__ - Num examples = 10
  93. 09/25/2024 12:22:40 - INFO - __main__ - Num batches each epoch = 5
  94. 09/25/2024 12:22:40 - INFO - __main__ - Num Epochs = 1
  95. 09/25/2024 12:22:40 - INFO - __main__ - Instantaneous batch size per device = 1
  96. 09/25/2024 12:22:40 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
  97. 09/25/2024 12:22:40 - INFO - __main__ - Gradient Accumulation steps = 4
  98. 09/25/2024 12:22:40 - INFO - __main__ - Total optimization steps = 2
  99. Steps: 0%| | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  100. /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  101. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
  102. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  103. /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  104. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
  105. Steps: 0%| | 0/2 [00:54<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  106. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  107. Steps: 0%| | 0/2 [01:00<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  108. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  109. Steps: 0%| | 0/2 [01:06<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  110. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  111. Steps: 50%|██████████ | 1/2 [01:13<01:13, 73.36s/it, loss=0.327, lr=1]09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1
  112. 09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving FSDP model
  113. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  114. warnings.warn(
  115. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  116. local_shape = tensor.shape
  117. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  118. tensor.shape,
  119. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  120. tensor.dtype,
  121. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  122. tensor.device,
  123. 09/25/2024 12:24:00 - INFO - accelerate.utils.fsdp_utils - Saving model to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0
  124. /usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py:107: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  125. dist_cp.save_state_dict(
  126. [rank1]:[E925 13:24:00.349880861 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
  127. [rank1]:[E925 13:24:00.350172793 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
  128. [rank0]:[E925 13:24:00.531450681 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
  129. [rank0]:[E925 13:24:00.531679792 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
  130. x2-h100:118401:119200 [0] NCCL INFO [Service thread] Connection closed by localRank 0
  131. x2-h100:118401:118401 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
  132. [rank0]:[E925 13:24:01.314158835 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  133. [rank0]:[E925 13:24:01.314175675 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  134. [rank0]:[E925 13:24:01.315466644 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
  135. [rank0]:[E925 13:24:01.315500284 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
  136. [rank0]:[E925 13:24:01.315507284 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  137. [rank0]: Traceback (most recent call last):
  138. [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
  139. [rank0]: main(args)
  140. [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
  141. [rank0]: accelerator.save_state(save_path)
  142. [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
  143. [rank0]: save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
  144. [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
  145. [rank0]: dist_cp.save_state_dict(
  146. [rank0]: File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
  147. [rank0]: return arg(*args, **kwargs)
  148. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
  149. [rank0]: return _save_state_dict(
  150. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
  151. [rank0]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
  152. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
  153. [rank0]: all_data = self.gather_object(local_data)
  154. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
  155. [rank0]: dist.gather_object(
  156. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  157. [rank0]: return func(*args, **kwargs)
  158. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
  159. [rank0]: all_gather(object_size_list, local_size, group=group)
  160. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  161. [rank0]: return func(*args, **kwargs)
  162. [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
  163. [rank0]: work.wait()
  164. [rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
  165. Steps: 50%|███████ | 1/2 [1:01:21<1:01:21, 3681.47s/it, loss=0.327, lr=1]
  166. x2-h100:118402:119198 [1] NCCL INFO [Service thread] Connection closed by localRank 1
  167. x2-h100:118402:121387 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
  168. [rank1]:[E925 13:24:02.286785745 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  169. [rank1]:[E925 13:24:02.286809745 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
  170. [rank1]:[E925 13:24:02.288169044 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
  171. [rank1]:[E925 13:24:02.288200024 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
  172. [rank1]:[E925 13:24:02.288208014 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  173. [rank1]: Traceback (most recent call last):
  174. [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
  175. [rank1]: main(args)
  176. [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
  177. [rank1]: accelerator.backward(loss)
  178. [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
  179. [rank1]: loss.backward(**kwargs)
  180. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
  181. [rank1]: torch.autograd.backward(
  182. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
  183. [rank1]: _engine_run_backward(
  184. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
  185. [rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
  186. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 666, in _pre_backward_hook
  187. [rank1]: _unshard(
  188. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
  189. [rank1]: handle.unshard()
  190. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
  191. [rank1]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
  192. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
  193. [rank1]: dist.all_gather_into_tensor(
  194. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
  195. [rank1]: return func(*args, **kwargs)
  196. [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
  197. [rank1]: work.wait()
  198. [rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
  199. W0925 13:24:22.405389 140053788743488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 118402 closing signal SIGTERM
  200. E0925 13:24:31.946576 140053788743488 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 118401) of binary: /usr/bin/python
  201. Traceback (most recent call last):
  202. File "/usr/local/bin/accelerate", line 8, in <module>
  203. sys.exit(main())
  204. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
  205. args.func(args)
  206. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
  207. multi_gpu_launcher(args)
  208. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
  209. distrib_run.run(args)
  210. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
  211. elastic_launch(
  212. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
  213. return launch_agent(self._config, self._entrypoint, list(args))
  214. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  215. raise ChildFailedError(
  216. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  217. ============================================================
  218. examples/dreambooth/train_dreambooth_flux.py FAILED
  219. ------------------------------------------------------------
  220. Failures:
  221. <NO_OTHER_FAILURES>
  222. ------------------------------------------------------------
  223. Root Cause (first observed failure):
  224. [0]:
  225. time : 2024-09-25_13:24:22
  226. host : x2-h100.internal.cloudapp.net
  227. rank : 0 (local_rank: 0)
  228. exitcode : 1 (pid: 118401)
  229. error_file: <N/A>
  230. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  231. ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement