Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- The following values were not passed to `accelerate launch` and had defaults used instead:
- `--num_processes` was set to a value of `8`
- More than one GPU was found, enabling multi-GPU training.
- If this was unintended please pass in `--num_processes=1`.
- `--num_machines` was set to a value of `1`
- `--mixed_precision` was set to a value of `'no'`
- `--dynamo_backend` was set to a value of `'no'`
- To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 7
- Local process index: 7
- Device: cuda:7
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 1
- Local process index: 1
- Device: cuda:1
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 6
- Local process index: 6
- Device: cuda:6
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 5
- Local process index: 5
- Device: cuda:5
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 0
- Local process index: 0
- Device: cuda:0
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 3
- Local process index: 3
- Device: cuda:3
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 4
- Local process index: 4
- Device: cuda:4
- Mixed precision type: fp16
- 11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
- Num processes: 8
- Process index: 2
- Local process index: 2
- Device: cuda:2
- Mixed precision type: fp16
- You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
- You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
- {'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio', 'thresholding'} was not found in config. Values will be initialized to default values.
- {'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- /usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
- table = cls._concat_blocks(blocks, axis=0)
- Half mapped dataset was not saved Object of type function is not JSON serializable
- The format kwargs must be JSON serializable, but key 'transform' isn't.
- Map: 0%| | 0/74568 [00:06<?, ? examples/s]
- Traceback (most recent call last):
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 1278, in <module>
- main(args)
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 893, in main
- train_dataset = train_dataset.map(
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 592, in wrapper
- out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 557, in wrapper
- out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3097, in map
- for rank, done, content in Dataset._map_single(**dataset_kwargs):
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3474, in _map_single
- batch = apply_function_on_filtered_inputs(
- File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
- processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
- File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 528, in compute_vae_encodings
- model_input = vae.encode(pixel_values).latent_dist.sample()
- File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
- return method(self, *args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 274, in encode
- h = self.encoder(x)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 165, in forward
- sample = down_block(sample)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1323, in forward
- hidden_states = resnet(hidden_states, temb=None, scale=scale)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 693, in forward
- hidden_states = self.nonlinearity(hidden_states)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
- return self._call_impl(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward
- return F.silu(input, inplace=self.inplace)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2072, in silu
- return torch._C._nn.silu(input)
- torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 11.80 GiB is free. Process 1506108 has 67.34 GiB memory in use. Of the allocated memory 66.60 GiB is allocated by PyTorch, and 120.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- [2023-11-16 07:06:54,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7560 closing signal SIGTERM
- [2023-11-16 07:06:54,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7561 closing signal SIGTERM
- [2023-11-16 07:06:54,118] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7562 closing signal SIGTERM
- [2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7563 closing signal SIGTERM
- [2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7564 closing signal SIGTERM
- [2023-11-16 07:06:54,120] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7565 closing signal SIGTERM
- [2023-11-16 07:06:54,121] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7566 closing signal SIGTERM
- [2023-11-16 07:06:56,105] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7559) of binary: /usr/bin/python
- Traceback (most recent call last):
- File "/usr/local/bin/accelerate", line 8, in <module>
- sys.exit(main())
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
- args.func(args)
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in launch_command
- multi_gpu_launcher(args)
- File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
- distrib_run.run(args)
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
- elastic_launch(
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- train_text_to_image_sdxl_timeout_increased.py FAILED
- ------------------------------------------------------------
- Failures:
- <NO_OTHER_FAILURES>
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2023-11-16_07:06:54
- host : 5fc1ae3a0e3c
- rank : 0 (local_rank: 0)
- exitcode : 1 (pid: 7559)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement