Untitled

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `8`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 7
Local process index: 7
Device: cuda:7

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 6
Local process index: 6
Device: cuda:6

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 5
Local process index: 5
Device: cuda:5

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 4
Local process index: 4
Device: cuda:4

Mixed precision type: fp16

11/16/2023 07:06:26 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio', 'thresholding'} was not found in config. Values will be initialized to default values.
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)
Half mapped dataset was not saved Object of type function is not JSON serializable
The format kwargs must be JSON serializable, but key 'transform' isn't.
Map:   0%|                                     | 0/74568 [00:06<?, ? examples/s]
Traceback (most recent call last):
  File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 1278, in <module>
    main(args)
  File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 893, in main
    train_dataset = train_dataset.map(
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/workspace/train_text_to_image_sdxl_timeout_increased.py", line 528, in compute_vae_encodings
    model_input = vae.encode(pixel_values).latent_dist.sample()
  File "/usr/local/lib/python3.10/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoder_kl.py", line 274, in encode
    h = self.encoder(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/vae.py", line 165, in forward
    sample = down_block(sample)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 1323, in forward
    hidden_states = resnet(hidden_states, temb=None, scale=scale)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 693, in forward
    hidden_states = self.nonlinearity(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2072, in silu
    return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 11.80 GiB is free. Process 1506108 has 67.34 GiB memory in use. Of the allocated memory 66.60 GiB is allocated by PyTorch, and 120.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-11-16 07:06:54,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7560 closing signal SIGTERM
[2023-11-16 07:06:54,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7561 closing signal SIGTERM
[2023-11-16 07:06:54,118] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7562 closing signal SIGTERM
[2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7563 closing signal SIGTERM
[2023-11-16 07:06:54,119] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7564 closing signal SIGTERM
[2023-11-16 07:06:54,120] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7565 closing signal SIGTERM
[2023-11-16 07:06:54,121] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7566 closing signal SIGTERM
[2023-11-16 07:06:56,105] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7559) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_text_to_image_sdxl_timeout_increased.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-16_07:06:54
  host      : 5fc1ae3a0e3c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7559)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================