Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- The following values were not passed to `accelerate launch` and had defaults used instead:
- `--num_cpu_threads_per_process` was set to `40` to improve out-of-box performance when training on CPUs
- To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
- [W925 12:22:21.945906518 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
- [W925 12:22:21.945939778 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
- [W925 12:22:21.963256156 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
- [W925 12:22:21.963280176 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
- 09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
- Num processes: 2
- Process index: 0
- Local process index: 0
- Device: cuda:0
- Mixed precision type: bf16
- You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
- You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
- You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
- 09/25/2024 12:22:21 - INFO - __main__ - Distributed environment: FSDP Backend: nccl
- Num processes: 2
- Process index: 1
- Local process index: 1
- Device: cuda:1
- Mixed precision type: bf16
- Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 14899.84it/s]
- Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 6615.62it/s]
- Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 7.74it/s]
- Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 29888.15it/s]
- Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00, 1.33it/s]
- Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10098.65it/s]
- {'axes_dims_rope'} was not found in config. Values will be initialized to default values.
- Using decoupled weight decay
- Using decoupled weight decay
- x2-h100:118401:118401 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
- x2-h100:118401:118401 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
- x2-h100:118401:118401 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
- x2-h100:118401:118401 [0] NCCL INFO cudaDriverVersion 12020
- NCCL version 2.20.5+cuda12.4
- x2-h100:118402:118402 [1] NCCL INFO cudaDriverVersion 12020
- x2-h100:118402:118402 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
- x2-h100:118402:118402 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.16<0>
- x2-h100:118402:118402 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
- x2-h100:118401:119172 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
- x2-h100:118401:119172 [0] NCCL INFO Failed to open libibverbs.so[.1]
- x2-h100:118401:119172 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
- x2-h100:118401:119172 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
- x2-h100:118402:119173 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
- x2-h100:118401:119172 [0] NCCL INFO Using non-device net plugin version 0
- x2-h100:118401:119172 [0] NCCL INFO Using network Socket
- x2-h100:118402:119173 [1] NCCL INFO Failed to open libibverbs.so[.1]
- x2-h100:118402:119173 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0,lo
- x2-h100:118402:119173 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.16<0>
- x2-h100:118402:119173 [1] NCCL INFO Using non-device net plugin version 0
- x2-h100:118402:119173 [1] NCCL INFO Using network Socket
- x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init START
- x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init START
- x2-h100:118401:119172 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
- x2-h100:118401:119172 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
- x2-h100:118402:119173 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
- x2-h100:118402:119173 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffff00,00000000
- x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
- x2-h100:118402:119173 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
- x2-h100:118402:119173 [1] NCCL INFO P2P Chunksize set to 131072
- x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
- x2-h100:118401:119172 [0] NCCL INFO Channel 00/04 : 0 1
- x2-h100:118401:119172 [0] NCCL INFO Channel 01/04 : 0 1
- x2-h100:118401:119172 [0] NCCL INFO Channel 02/04 : 0 1
- x2-h100:118401:119172 [0] NCCL INFO Channel 03/04 : 0 1
- x2-h100:118401:119172 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
- x2-h100:118401:119172 [0] NCCL INFO P2P Chunksize set to 131072
- x2-h100:118402:119173 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
- x2-h100:118402:119173 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
- x2-h100:118402:119173 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
- x2-h100:118402:119173 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
- x2-h100:118401:119172 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
- x2-h100:118401:119172 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
- x2-h100:118401:119172 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
- x2-h100:118401:119172 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
- x2-h100:118401:119172 [0] NCCL INFO Connected all rings
- x2-h100:118401:119172 [0] NCCL INFO Connected all trees
- x2-h100:118402:119173 [1] NCCL INFO Connected all rings
- x2-h100:118402:119173 [1] NCCL INFO Connected all trees
- x2-h100:118402:119173 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
- x2-h100:118402:119173 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
- x2-h100:118401:119172 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
- x2-h100:118401:119172 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
- x2-h100:118402:119173 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0x793d473d6144a18a - Init COMPLETE
- x2-h100:118401:119172 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x793d473d6144a18a - Init COMPLETE
- 09/25/2024 12:22:40 - INFO - __main__ - ***** Running training *****
- 09/25/2024 12:22:40 - INFO - __main__ - Num examples = 10
- 09/25/2024 12:22:40 - INFO - __main__ - Num batches each epoch = 5
- 09/25/2024 12:22:40 - INFO - __main__ - Num Epochs = 1
- 09/25/2024 12:22:40 - INFO - __main__ - Instantaneous batch size per device = 1
- 09/25/2024 12:22:40 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
- 09/25/2024 12:22:40 - INFO - __main__ - Gradient Accumulation steps = 4
- 09/25/2024 12:22:40 - INFO - __main__ - Total optimization steps = 2
- Steps: 0%| | 0/2 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
- with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
- Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- /usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
- with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
- Steps: 0%| | 0/2 [00:54<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Steps: 0%| | 0/2 [01:00<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Steps: 0%| | 0/2 [01:06<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
- Steps: 50%|██████████ | 1/2 [01:13<01:13, 73.36s/it, loss=0.327, lr=1]09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving current state to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1
- 09/25/2024 12:23:53 - INFO - accelerate.accelerator - Saving FSDP model
- /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
- warnings.warn(
- /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:737: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
- local_shape = tensor.shape
- /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:749: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
- tensor.shape,
- /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:751: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
- tensor.dtype,
- /usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_state_dict_utils.py:752: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
- tensor.device,
- 09/25/2024 12:24:00 - INFO - accelerate.utils.fsdp_utils - Saving model to /flux-dreambooth-outputs/dreamboot-yaremovaa/checkpoint-1/pytorch_model_fsdp_0
- /usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py:107: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
- dist_cp.save_state_dict(
- [rank1]:[E925 13:24:00.349880861 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
- [rank1]:[E925 13:24:00.350172793 ProcessGroupNCCL.cpp:670] [Rank 1] Work WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
- [rank0]:[E925 13:24:00.531450681 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
- [rank0]:[E925 13:24:00.531679792 ProcessGroupNCCL.cpp:670] [Rank 0] Work WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
- x2-h100:118401:119200 [0] NCCL INFO [Service thread] Connection closed by localRank 0
- x2-h100:118401:118401 [0] NCCL INFO comm 0x24e3ff20 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
- [rank0]:[E925 13:24:01.314158835 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank0]:[E925 13:24:01.314175675 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
- [rank0]:[E925 13:24:01.315466644 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
- [rank0]:[E925 13:24:01.315500284 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1191, last enqueued NCCL work: 1191, last completed NCCL work: 1190.
- [rank0]:[E925 13:24:01.315507284 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank0]: Traceback (most recent call last):
- [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
- [rank0]: main(args)
- [rank0]: File "examples/dreambooth/train_dreambooth_flux.py", line 1684, in main
- [rank0]: accelerator.save_state(save_path)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2991, in save_state
- [rank0]: save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/fsdp_utils.py", line 107, in save_fsdp_model
- [rank0]: dist_cp.save_state_dict(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/typing_extensions.py", line 2853, in wrapper
- [rank0]: return arg(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 47, in save_state_dict
- [rank0]: return _save_state_dict(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/state_dict_saver.py", line 316, in _save_state_dict
- [rank0]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 169, in reduce_scatter
- [rank0]: all_data = self.gather_object(local_data)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
- [rank0]: dist.gather_object(
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank0]: return func(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2608, in gather_object
- [rank0]: all_gather(object_size_list, local_size, group=group)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank0]: return func(*args, **kwargs)
- [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3113, in all_gather
- [rank0]: work.wait()
- [rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1191, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=3600000) ran for 3600009 milliseconds before timing out.
- Steps: 50%|███████ | 1/2 [1:01:21<1:01:21, 3681.47s/it, loss=0.327, lr=1]
- x2-h100:118402:119198 [1] NCCL INFO [Service thread] Connection closed by localRank 1
- x2-h100:118402:121387 [1] NCCL INFO comm 0x160e5020 rank 1 nranks 2 cudaDev 1 busId 200000 - Abort COMPLETE
- [rank1]:[E925 13:24:02.286785745 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank1]:[E925 13:24:02.286809745 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
- [rank1]:[E925 13:24:02.288169044 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
- [rank1]:[E925 13:24:02.288200024 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1248, last enqueued NCCL work: 1248, last completed NCCL work: 1247.
- [rank1]:[E925 13:24:02.288208014 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- [rank1]: Traceback (most recent call last):
- [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1795, in <module>
- [rank1]: main(args)
- [rank1]: File "examples/dreambooth/train_dreambooth_flux.py", line 1643, in main
- [rank1]: accelerator.backward(loss)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 2196, in backward
- [rank1]: loss.backward(**kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 521, in backward
- [rank1]: torch.autograd.backward(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 289, in backward
- [rank1]: _engine_run_backward(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
- [rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 666, in _pre_backward_hook
- [rank1]: _unshard(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 300, in _unshard
- [rank1]: handle.unshard()
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1311, in unshard
- [rank1]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 1402, in _all_gather_flat_param
- [rank1]: dist.all_gather_into_tensor(
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
- [rank1]: return func(*args, **kwargs)
- [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3205, in all_gather_into_tensor
- [rank1]: work.wait()
- [rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1248, OpType=_ALLGATHER_BASE, NumelIn=70795904, NumelOut=141591808, Timeout(ms)=3600000) ran for 3600001 milliseconds before timing out.
- W0925 13:24:22.405389 140053788743488 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 118402 closing signal SIGTERM
- E0925 13:24:31.946576 140053788743488 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 118401) of binary: /usr/bin/python
- Traceback (most recent call last):
- File "/usr/local/bin/accelerate", line 8, in <module>
- sys.exit(main())
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
- args.func(args)
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
- multi_gpu_launcher(args)
- File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
- distrib_run.run(args)
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
- elastic_launch(
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- examples/dreambooth/train_dreambooth_flux.py FAILED
- ------------------------------------------------------------
- Failures:
- <NO_OTHER_FAILURES>
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2024-09-25_13:24:22
- host : x2-h100.internal.cloudapp.net
- rank : 0 (local_rank: 0)
- exitcode : 1 (pid: 118401)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement