Advertisement
kopyl

Untitled

Nov 16th, 2023
47
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 10.76 KB | None | 0 0
  1. The following values were not passed to `accelerate launch` and had defaults used instead:
  2. `--num_processes` was set to a value of `8`
  3. More than one GPU was found, enabling multi-GPU training.
  4. If this was unintended please pass in `--num_processes=1`.
  5. `--num_machines` was set to a value of `1`
  6. `--mixed_precision` was set to a value of `'no'`
  7. `--dynamo_backend` was set to a value of `'no'`
  8. To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  9. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  10. Num processes: 8
  11. Process index: 7
  12. Local process index: 7
  13. Device: cuda:7
  14.  
  15. Mixed precision type: fp16
  16.  
  17. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  18. Num processes: 8
  19. Process index: 1
  20. Local process index: 1
  21. Device: cuda:1
  22.  
  23. Mixed precision type: fp16
  24.  
  25. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  26. Num processes: 8
  27. Process index: 6
  28. Local process index: 6
  29. Device: cuda:6
  30.  
  31. Mixed precision type: fp16
  32.  
  33. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  34. Num processes: 8
  35. Process index: 5
  36. Local process index: 5
  37. Device: cuda:5
  38.  
  39. Mixed precision type: fp16
  40.  
  41. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  42. Num processes: 8
  43. Process index: 0
  44. Local process index: 0
  45. Device: cuda:0
  46.  
  47. Mixed precision type: fp16
  48.  
  49. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  50. Num processes: 8
  51. Process index: 3
  52. Local process index: 3
  53. Device: cuda:3
  54.  
  55. Mixed precision type: fp16
  56.  
  57. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  58. Num processes: 8
  59. Process index: 4
  60. Local process index: 4
  61. Device: cuda:4
  62.  
  63. Mixed precision type: fp16
  64.  
  65. 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
  66. Num processes: 8
  67. Process index: 2
  68. Local process index: 2
  69. Device: cuda:2
  70.  
  71. Mixed precision type: fp16
  72.  
  73. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  74. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
  75. {'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio', 'thresholding'} was not found in config. Values will be initialized to default values.
  76. {'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
  77. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  78. table = cls._concat_blocks(blocks, axis=0)
  79. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  80. table = cls._concat_blocks(blocks, axis=0)
  81. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  82. table = cls._concat_blocks(blocks, axis=0)
  83. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  84. table = cls._concat_blocks(blocks, axis=0)
  85. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  86. table = cls._concat_blocks(blocks, axis=0)
  87. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  88. table = cls._concat_blocks(blocks, axis=0)
  89. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  90. table = cls._concat_blocks(blocks, axis=0)
  91. /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  92. table = cls._concat_blocks(blocks, axis=0)
  93. Half mapped dataset was not saved Object of type function is not JSON serializable
  94. The format kwargs must be JSON serializable, but key 'transform' isn't.
  95. Map: 0%| | 0/74568 [00:06<?, ? examples/s]
  96. Traceback (most recent call last):
  97. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 1278, in <module>
  98. main(args)
  99. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 893, in main
  100. train_dataset = train_dataset.map(
  101. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 592, in wrapper
  102. out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  103. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 557, in wrapper
  104. out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  105. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3097, in map
  106. for rank, done, content in Dataset._map_single(**dataset_kwargs):
  107. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3474, in _map_single
  108. batch = apply_function_on_filtered_inputs(
  109. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
  110. processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  111. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 528, in compute_vae_encodings
  112. model_input = vae.encode(pixel_values).latent_dist.sample()
  113. File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
  114. return method(self, *args, **kwargs)
  115. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 274, in encode
  116. h = self.encoder(x)
  117. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  118. return self._call_impl(*args, **kwargs)
  119. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  120. return forward_call(*args, **kwargs)
  121. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 165, in forward
  122. sample = down_block(sample)
  123. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  124. return self._call_impl(*args, **kwargs)
  125. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  126. return forward_call(*args, **kwargs)
  127. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1323, in forward
  128. hidden_states = resnet(hidden_states, temb=None, scale=scale)
  129. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  130. return self._call_impl(*args, **kwargs)
  131. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  132. return forward_call(*args, **kwargs)
  133. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 693, in forward
  134. hidden_states = self.nonlinearity(hidden_states)
  135. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  136. return self._call_impl(*args, **kwargs)
  137. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  138. return forward_call(*args, **kwargs)
  139. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward
  140. return F.silu(input, inplace=self.inplace)
  141. File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2072, in silu
  142. return torch._C._nn.silu(input)
  143. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 11.80 GiB is free. Process 1506108 has 67.34 GiB memory in use. Of the allocated memory 66.60 GiB is allocated by PyTorch, and 120.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  144. [2023-11-16 07:06:54,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7560 closing signal SIGTERM
  145. [2023-11-16 07:06:54,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7561 closing signal SIGTERM
  146. [2023-11-16 07:06:54,118] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7562 closing signal SIGTERM
  147. [2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7563 closing signal SIGTERM
  148. [2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7564 closing signal SIGTERM
  149. [2023-11-16 07:06:54,120] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7565 closing signal SIGTERM
  150. [2023-11-16 07:06:54,121] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7566 closing signal SIGTERM
  151. [2023-11-16 07:06:56,105] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7559) of binary: /usr/bin/python
  152. Traceback (most recent call last):
  153. File "/usr/local/bin/accelerate", line 8, in <module>
  154. sys.exit(main())
  155. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
  156. args.func(args)
  157. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in launch_command
  158. multi_gpu_launcher(args)
  159. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
  160. distrib_run.run(args)
  161. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
  162. elastic_launch(
  163. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
  164. return launch_agent(self._config, self._entrypoint, list(args))
  165. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  166. raise ChildFailedError(
  167. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  168. ============================================================
  169. train_text_to_image_sdxl_timeout_increased.py FAILED
  170. ------------------------------------------------------------
  171. Failures:
  172. <NO_OTHER_FAILURES>
  173. ------------------------------------------------------------
  174. Root Cause (first observed failure):
  175. [0]:
  176. time : 2023-11-16_07:06:54
  177. host : 5fc1ae3a0e3c
  178. rank : 0 (local_rank: 0)
  179. exitcode : 1 (pid: 7559)
  180. error_file: <N/A>
  181. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  182. ============================================================
  183.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement