Advertisement
kopyl

Untitled

Nov 16th, 2023
57
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 6.33 KB | None | 0 0
  1. Map: 100%|█████████████████████████| 74568/74568 [34:12<00:00, 36.32 examples/s]
  2. Half mapped dataset was not saved Object of type function is not JSON serializable
  3. The format kwargs must be JSON serializable, but key 'transform' isn't.
  4. Map: 0%| | 0/74568 [00:04<?, ? examples/s]
  5. Traceback (most recent call last):
  6. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 1278, in <module>
  7. main(args)
  8. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 893, in main
  9. train_dataset = train_dataset.map(
  10. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 591, in wrapper
  11. out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  12. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 556, in wrapper
  13. out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  14. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3089, in map
  15. for rank, done, content in Dataset._map_single(**dataset_kwargs):
  16. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3466, in _map_single
  17. batch = apply_function_on_filtered_inputs(
  18. File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3345, in apply_function_on_filtered_inputs
  19. processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  20. File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 528, in compute_vae_encodings
  21. model_input = vae.encode(pixel_values).latent_dist.sample()
  22. File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
  23. return method(self, *args, **kwargs)
  24. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 274, in encode
  25. h = self.encoder(x)
  26. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  27. return self._call_impl(*args, **kwargs)
  28. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  29. return forward_call(*args, **kwargs)
  30. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 165, in forward
  31. sample = down_block(sample)
  32. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  33. return self._call_impl(*args, **kwargs)
  34. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  35. return forward_call(*args, **kwargs)
  36. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1323, in forward
  37. hidden_states = resnet(hidden_states, temb=None, scale=scale)
  38. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  39. return self._call_impl(*args, **kwargs)
  40. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  41. return forward_call(*args, **kwargs)
  42. File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 755, in forward
  43. output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
  44. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 12.07 GiB is free. Process 1512076 has 67.07 GiB memory in use. Of the allocated memory 66.24 GiB is allocated by PyTorch, and 200.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  45. [2023-11-16 08:26:54,835] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 511 closing signal SIGTERM
  46. [2023-11-16 08:26:54,835] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 512 closing signal SIGTERM
  47. [2023-11-16 08:26:54,836] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 513 closing signal SIGTERM
  48. [2023-11-16 08:26:54,837] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 514 closing signal SIGTERM
  49. [2023-11-16 08:26:54,838] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 515 closing signal SIGTERM
  50. [2023-11-16 08:26:54,839] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 516 closing signal SIGTERM
  51. [2023-11-16 08:26:54,839] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 517 closing signal SIGTERM
  52. [2023-11-16 08:26:56,859] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 510) of binary: /usr/bin/python
  53. Traceback (most recent call last):
  54. File "/usr/local/bin/accelerate", line 8, in <module>
  55. sys.exit(main())
  56. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
  57. args.func(args)
  58. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in launch_command
  59. multi_gpu_launcher(args)
  60. File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
  61. distrib_run.run(args)
  62. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
  63. elastic_launch(
  64. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
  65. return launch_agent(self._config, self._entrypoint, list(args))
  66. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  67. raise ChildFailedError(
  68. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  69. ============================================================
  70. train_text_to_image_sdxl_timeout_increased.py FAILED
  71. ------------------------------------------------------------
  72. Failures:
  73. <NO_OTHER_FAILURES>
  74. ------------------------------------------------------------
  75. Root Cause (first observed failure):
  76. [0]:
  77. time : 2023-11-16_08:26:54
  78. host : 47d9d483a16a
  79. rank : 0 (local_rank: 0)
  80. exitcode : 1 (pid: 510)
  81. error_file: <N/A>
  82. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  83. ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement