Untitled

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[W925 12:22:21.945906518 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 12:22:21.945939778 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W925 12:22:21.963256156 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
[W925 12:22:21.963280176 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 14899.84it/s]
Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 6615.62it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  7.74it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 29888.15it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.33it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10098.65it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
x2-h100:118401:118401 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:118401:118401 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:118401:118401 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:118401:118401 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
x2-h100:118402:118402 [1] NCCL INFO cudaDriverVersion 12020
x2-h100:118402:118402 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:118402:118402 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
x2-h100:118402:118402 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x2-h100:118401:119172 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:118401:119172 [0] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:118401:119172 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:118401:119172 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:118402:119173 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
x2-h100:118401:119172 [0] NCCL INFO Using non-device net plugin version 0
x2-h100:118401:119172 [0] NCCL INFO Using network Socket
x2-h100:118402:119173 [1] NCCL INFO Failed to open libibverbs.so[.1]
x2-h100:118402:119173 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
x2-h100:118402:119173 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
x2-h100:118402:119173 [1] NCCL INFO Using non-device net plugin version 0
x2-h100:118402:119173 [1] NCCL INFO Using network Socket
x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init START
x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init START
x2-h100:118401:119172 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:118401:119172 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
x2-h100:118402:119173 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
x2-h100:118402:119173 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
x2-h100:118402:119173 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
x2-h100:118402:119173 [1] NCCL INFO P2P Chunksize set to 131072
x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
x2-h100:118401:119172 [0] NCCL INFO Channel 00/04 :    0   1
x2-h100:118401:119172 [0] NCCL INFO Channel 01/04 :    0   1
x2-h100:118401:119172 [0] NCCL INFO Channel 02/04 :    0   1
x2-h100:118401:119172 [0] NCCL INFO Channel 03/04 :    0   1
x2-h100:118401:119172 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
x2-h100:118401:119172 [0] NCCL INFO P2P Chunksize set to 131072
x2-h100:118402:119173 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:118402:119173 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:118402:119173 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:118402:119173 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
x2-h100:118401:119172 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:118401:119172 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:118401:119172 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:118401:119172 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
x2-h100:118401:119172 [0] NCCL INFO Connected all rings
x2-h100:118401:119172 [0] NCCL INFO Connected all trees
x2-h100:118402:119173 [1] NCCL INFO Connected all rings
x2-h100:118402:119173 [1] NCCL INFO Connected all trees
x2-h100:118402:119173 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:118402:119173 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:118401:119172 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x2-h100:118401:119172 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init COMPLETE
x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init COMPLETE
09/25/2024 12:22:40 - INFO - __main__ - ***** Running training *****
09/25/2024 12:22:40 - INFO - __main__ -   Num examples = 10
09/25/2024 12:22:40 - INFO - __main__ -   Num batches each epoch = 5
09/25/2024 12:22:40 - INFO - __main__ -   Num Epochs = 1
09/25/2024 12:22:40 - INFO - __main__ -   Instantaneous batch size per device = 1
09/25/2024 12:22:40 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/25/2024 12:22:40 - INFO - __main__ -   Gradient Accumulation steps = 4
09/25/2024 12:22:40 - INFO - __main__ -   Total optimization steps = 2
Steps:   0%|                                              | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                              | 0/2 [00:54<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [01:00<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                            | 0/2 [01:06<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:  50%|██████████          | 1/2 [01:13<01:13, 73.36s/it, loss=0.327, lr=1]09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1
09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving FSDP model
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  local_shape = tensor.shape
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.shape,
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.dtype,
/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
  tensor.device,
09/25/2024 12:24:00 - INFO - accelerate.utils.fsdp_utils - Saving model to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0
/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py:107: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
  dist_cp.save_state_dict(
[rank1]:[E925 13:24:00.349880861 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
[rank1]:[E925 13:24:00.350172793 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank0]:[E925 13:24:00.531450681 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
[rank0]:[E925 13:24:00.531679792 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
x2-h100:118401:119200 [0] NCCL INFO [Service thread] Connection closed by localRank 0
x2-h100:118401:118401 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
[rank0]:[E925 13:24:01.314158835 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E925 13:24:01.314175675 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E925 13:24:01.315466644 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
[rank0]:[E925 13:24:01.315500284 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
[rank0]:[E925 13:24:01.315507284 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]: Traceback (most recent call last):
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank0]:     main(args)
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
[rank0]:     accelerator.save_state(save_path)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
[rank0]:     save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
[rank0]:     dist_cp.save_state_dict(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank0]:     return arg(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
[rank0]:     return _save_state_dict(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
[rank0]:     central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
[rank0]:     all_data = self.gather_object(local_data)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
[rank0]:     dist.gather_object(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
[rank0]:     all_gather(object_size_list, local_size, group=group)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
[rank0]:     work.wait()
[rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
Steps:  50%|███████       | 1/2 [1:01:21<1:01:21, 3681.47s/it, loss=0.327, lr=1]
x2-h100:118402:119198 [1] NCCL INFO [Service thread] Connection closed by localRank 1
x2-h100:118402:121387 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
[rank1]:[E925 13:24:02.286785745 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E925 13:24:02.286809745 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E925 13:24:02.288169044 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
[rank1]:[E925 13:24:02.288200024 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
[rank1]:[E925 13:24:02.288208014 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]: Traceback (most recent call last):
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
[rank1]:     main(args)
[rank1]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
[rank1]:     accelerator.backward(loss)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 666, in _pre_backward_hook
[rank1]:     _unshard(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
[rank1]:     handle.unshard()
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
[rank1]:     padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
[rank1]:     dist.all_gather_into_tensor(
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
[rank1]:     work.wait()
[rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
W0925 13:24:22.405389 140053788743488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 118402 closing signal SIGTERM
E0925 13:24:31.946576 140053788743488 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 118401) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_13:24:22
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 118401)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================