Advertisement
kopyl

Untitled

Sep 25th, 2024
27
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 14.18 KB | None | 0 0
  1. The following values were not passed to `accelerate launch` and had defaults used instead:
  2. `--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
  3. To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  4. [W925 11:14:22.847626929 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
  5. [W925 11:14:22.847654179 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
  6. [W925 11:14:22.848692186 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
  7. [W925 11:14:22.848717216 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
  8. 09/25/2024 11:14:22 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
  9. Num processes: 2
  10. Process index: 0
  11. Local process index: 0
  12. Device: cuda:0
  13.  
  14. Mixed precision type: bf16
  15.  
  16. You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
  17. 09/25/2024 11:14:22 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
  18. Num processes: 2
  19. Process index: 1
  20. Local process index: 1
  21. Device: cuda:1
  22.  
  23. Mixed precision type: bf16
  24.  
  25. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  26. You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  27. Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 4888.47it/s]
  28. Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 15307.68it/s]
  29. Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00, 1.31it/s]
  30. Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 73584.28it/s]
  31. Loading checkpoint shards: 100%|██████████████████| 2/2 [00:57<00:00, 28.72s/it]
  32. Fetching 3 files: 100%|█████████████████████████| 3/3 [00:00<00:00, 8726.01it/s]
  33. {'axes_dims_rope'} was not found in config. Values will be initialized to default values.
  34. Using decoupled weight decay
  35. Using decoupled weight decay
  36. x2-h100:3217:3217 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  37. x2-h100:3217:3217 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
  38. x2-h100:3217:3217 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
  39. x2-h100:3217:3217 [0] NCCL INFO cudaDriverVersion 12020
  40. NCCL version 2.20.5+cuda12.4
  41. x2-h100:3218:3218 [1] NCCL INFO cudaDriverVersion 12020
  42. x2-h100:3218:3218 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  43. x2-h100:3218:3218 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
  44. x2-h100:3218:3218 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
  45. x2-h100:3218:4557 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
  46. x2-h100:3218:4557 [1] NCCL INFO Failed to open libibverbs.so[.1]
  47. x2-h100:3218:4557 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  48. x2-h100:3218:4557 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
  49. x2-h100:3217:4556 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
  50. x2-h100:3217:4556 [0] NCCL INFO Failed to open libibverbs.so[.1]
  51. x2-h100:3217:4556 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
  52. x2-h100:3218:4557 [1] NCCL INFO Using non-device net plugin version 0
  53. x2-h100:3218:4557 [1] NCCL INFO Using network Socket
  54. x2-h100:3217:4556 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
  55. x2-h100:3217:4556 [0] NCCL INFO Using non-device net plugin version 0
  56. x2-h100:3217:4556 [0] NCCL INFO Using network Socket
  57. x2-h100:3218:4557 [1] NCCL INFO comm 0x2f5704c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xe3ead5030c28e9f0 - Init START
  58. x2-h100:3217:4556 [0] NCCL INFO comm 0x27143e10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xe3ead5030c28e9f0 - Init START
  59. x2-h100:3217:4556 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
  60. x2-h100:3218:4557 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
  61. x2-h100:3217:4556 [0] NCCL INFO comm 0x27143e10 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
  62. x2-h100:3218:4557 [1] NCCL INFO comm 0x2f5704c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
  63. x2-h100:3218:4557 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1
  64. x2-h100:3218:4557 [1] NCCL INFO P2P Chunksize set to 524288
  65. x2-h100:3217:4556 [0] NCCL INFO Channel 00/16 : 0 1
  66. x2-h100:3217:4556 [0] NCCL INFO Channel 01/16 : 0 1
  67. x2-h100:3217:4556 [0] NCCL INFO Channel 02/16 : 0 1
  68. x2-h100:3217:4556 [0] NCCL INFO Channel 03/16 : 0 1
  69. x2-h100:3217:4556 [0] NCCL INFO Channel 04/16 : 0 1
  70. x2-h100:3217:4556 [0] NCCL INFO Channel 05/16 : 0 1
  71. x2-h100:3217:4556 [0] NCCL INFO Channel 06/16 : 0 1
  72. x2-h100:3217:4556 [0] NCCL INFO Channel 07/16 : 0 1
  73. x2-h100:3217:4556 [0] NCCL INFO Channel 08/16 : 0 1
  74. x2-h100:3217:4556 [0] NCCL INFO Channel 09/16 : 0 1
  75. x2-h100:3217:4556 [0] NCCL INFO Channel 10/16 : 0 1
  76. x2-h100:3217:4556 [0] NCCL INFO Channel 11/16 : 0 1
  77. x2-h100:3217:4556 [0] NCCL INFO Channel 12/16 : 0 1
  78. x2-h100:3217:4556 [0] NCCL INFO Channel 13/16 : 0 1
  79. x2-h100:3217:4556 [0] NCCL INFO Channel 14/16 : 0 1
  80. x2-h100:3217:4556 [0] NCCL INFO Channel 15/16 : 0 1
  81. x2-h100:3217:4556 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1
  82. x2-h100:3217:4556 [0] NCCL INFO P2P Chunksize set to 524288
  83. x2-h100:3217:4556 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
  84. x2-h100:3218:4557 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
  85. x2-h100:3218:4557 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
  86. x2-h100:3217:4556 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
  87. x2-h100:3218:4557 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
  88. x2-h100:3217:4556 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
  89. x2-h100:3218:4557 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
  90. x2-h100:3217:4556 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
  91. x2-h100:3218:4557 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
  92. x2-h100:3217:4556 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
  93. x2-h100:3218:4557 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
  94. x2-h100:3217:4556 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
  95. x2-h100:3218:4557 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
  96. x2-h100:3217:4556 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
  97. x2-h100:3218:4557 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
  98. x2-h100:3217:4556 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
  99. x2-h100:3218:4557 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
  100. x2-h100:3217:4556 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
  101. x2-h100:3218:4557 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM
  102. x2-h100:3217:4556 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
  103. x2-h100:3218:4557 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
  104. x2-h100:3217:4556 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
  105. x2-h100:3218:4557 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
  106. x2-h100:3217:4556 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
  107. x2-h100:3218:4557 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM
  108. x2-h100:3217:4556 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
  109. x2-h100:3218:4557 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM
  110. x2-h100:3217:4556 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
  111. x2-h100:3218:4557 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
  112. x2-h100:3217:4556 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
  113. x2-h100:3218:4557 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM
  114. x2-h100:3217:4556 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
  115. x2-h100:3217:4556 [0] NCCL INFO Connected all rings
  116. x2-h100:3218:4557 [1] NCCL INFO Connected all rings
  117. x2-h100:3217:4556 [0] NCCL INFO Connected all trees
  118. x2-h100:3218:4557 [1] NCCL INFO Connected all trees
  119. x2-h100:3218:4557 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
  120. x2-h100:3218:4557 [1] NCCL INFO 16 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
  121. x2-h100:3217:4556 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
  122. x2-h100:3217:4556 [0] NCCL INFO 16 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
  123. x2-h100:3218:4557 [1] NCCL INFO comm 0x2f5704c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xe3ead5030c28e9f0 - Init COMPLETE
  124. x2-h100:3217:4556 [0] NCCL INFO comm 0x27143e10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xe3ead5030c28e9f0 - Init COMPLETE
  125. 09/25/2024 11:16:58 - INFO - __main__ - ***** Running training *****
  126. 09/25/2024 11:16:58 - INFO - __main__ - Num examples = 10
  127. 09/25/2024 11:16:58 - INFO - __main__ - Num batches each epoch = 5
  128. 09/25/2024 11:16:58 - INFO - __main__ - Num Epochs = 1
  129. 09/25/2024 11:16:58 - INFO - __main__ - Instantaneous batch size per device = 1
  130. 09/25/2024 11:16:58 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
  131. 09/25/2024 11:16:58 - INFO - __main__ - Gradient Accumulation steps = 4
  132. 09/25/2024 11:16:58 - INFO - __main__ - Total optimization steps = 2
  133. Steps: 0%| | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  134. /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  135. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
  136. /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  137. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
  138. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  139. Steps: 0%| | 0/2 [01:27<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  140. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  141. Steps: 0%| | 0/2 [01:31<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  142. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  143. Steps: 0%| | 0/2 [01:35<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  144. Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
  145. Steps: 50%|█████████▌ | 1/2 [01:40<01:40, 100.45s/it, loss=0.327, lr=1]09/25/2024 11:18:39 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1
  146. 09/25/2024 11:18:39 - INFO - accelerate.accelerator - Saving FSDP model
  147. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  148. warnings.warn(
  149. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  150. local_shape = tensor.shape
  151. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  152. tensor.shape,
  153. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  154. tensor.dtype,
  155. /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  156. tensor.device,
  157. 09/25/2024 11:18:45 - INFO - accelerate.utils.fsdp_utils - Saving model to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0
  158. /usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py:107: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  159. dist_cp.save_state_dict(
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement