Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Map: 100%|█████████████████████████| 74568/74568 [34:12<00:00, 36.32 examples/s]
- Half mapped dataset was not saved Object of type function is not JSON serializable
- The format kwargs must be JSON serializable, but key 'transform' isn't.
- Map: 0%| | 0/74568 [00:04<?, ? examples/s]
- Traceback (most recent call last):
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 1278, in <module>
- main(args)
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 893, in main
- train_dataset = train_dataset.map(
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 591, in wrapper
- out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 556, in wrapper
- out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3089, in map
- for rank, done, content in Dataset._map_single(**dataset_kwargs):
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3466, in _map_single
- batch = apply_function_on_filtered_inputs(
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3345, in apply_function_on_filtered_inputs
- processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 528, in compute_vae_encodings
- model_input = vae.encode(pixel_values).latent_dist.sample()
- File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
- return method(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 274, in encode
- h = self.encoder(x)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 165, in forward
- sample = down_block(sample)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1323, in forward
- hidden_states = resnet(hidden_states, temb=None, scale=scale)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 755, in forward
- output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
- torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 12.07 GiB is free. Process 1512076 has 67.07 GiB memory in use. Of the allocated memory 66.24 GiB is allocated by PyTorch, and 200.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- [2023-11-16 08:26:54,835] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 511 closing signal SIGTERM
- [2023-11-16 08:26:54,835] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 512 closing signal SIGTERM
- [2023-11-16 08:26:54,836] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 513 closing signal SIGTERM
- [2023-11-16 08:26:54,837] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 514 closing signal SIGTERM
- [2023-11-16 08:26:54,838] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 515 closing signal SIGTERM
- [2023-11-16 08:26:54,839] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 516 closing signal SIGTERM
- [2023-11-16 08:26:54,839] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 517 closing signal SIGTERM
- [2023-11-16 08:26:56,859] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 510) of binary: /usr/bin/python
- Traceback (most recent call last):
- File "/usr/local/bin/accelerate", line 8, in <module>
- sys.exit(main())
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
- args.func(args)
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in launch_command
- multi_gpu_launcher(args)
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
- distrib_run.run(args)
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
- elastic_launch(
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- train_text_to_image_sdxl_timeout_increased.py FAILED
- ------------------------------------------------------------
- Failures:
- <NO_OTHER_FAILURES>
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2023-11-16_08:26:54
- host : 47d9d483a16a
- rank : 0 (local_rank: 0)
- exitcode : 1 (pid: 510)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement