kopyl

DeepSpeed wrong config

Sep 27th, 2024
15
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 5.78 KB | None | 0 0
  1. [2024-09-27 13:40:31,969] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
  2. [WARNING] On Ampere and higher architectures please use CUDA 11+
  3. [WARNING] On Ampere and higher architectures please use CUDA 11+
  4. [WARNING] On Ampere and higher architectures please use CUDA 11+
  5. [WARNING] On Ampere and higher architectures please use CUDA 11+
  6. [WARNING] On Ampere and higher architectures please use CUDA 11+
  7. [WARNING] On Ampere and higher architectures please use CUDA 11+
  8. W0927 13:40:33.228857 140401003063104 torch/distributed/run.py:779]
  9. W0927 13:40:33.228857 140401003063104 torch/distributed/run.py:779] *****************************************
  10. W0927 13:40:33.228857 140401003063104 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  11. W0927 13:40:33.228857 140401003063104 torch/distributed/run.py:779] *****************************************
  12. Traceback (most recent call last):
  13. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 51, in __init__
  14. config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")
  15. File "/usr/lib/python3.8/base64.py", line 133, in urlsafe_b64decode
  16. Traceback (most recent call last):
  17. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 51, in __init__
  18. config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8")
  19. File "/usr/lib/python3.8/base64.py", line 133, in urlsafe_b64decode
  20. return b64decode(s)
  21. File "/usr/lib/python3.8/base64.py", line 87, in b64decode
  22. return b64decode(s)
  23. File "/usr/lib/python3.8/base64.py", line 87, in b64decode
  24. return binascii.a2b_base64(s)
  25. binascii .return binascii.a2b_base64(s)Error
  26. : Incorrect paddingbinascii
  27. .
  28. During handling of the above exception, another exception occurred:
  29.  
  30. ErrorTraceback (most recent call last):
  31. : File "examples/dreambooth/train_dreambooth_flux.py", line 1801, in <module>
  32. Incorrect padding
  33.  
  34. During handling of the above exception, another exception occurred:
  35.  
  36. Traceback (most recent call last):
  37. File "examples/dreambooth/train_dreambooth_flux.py", line 1801, in <module>
  38. main(args)
  39. File "examples/dreambooth/train_dreambooth_flux.py", line 998, in main
  40. main(args)
  41. File "examples/dreambooth/train_dreambooth_flux.py", line 998, in main
  42. accelerator = Accelerator(
  43. File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 285, in __init__
  44. accelerator = Accelerator(
  45. File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 285, in __init__
  46. DeepSpeedPlugin() if os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true" else None
  47. File "<string>", line 15, in __init__
  48. DeepSpeedPlugin() if os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true" else None
  49. File "<string>", line 15, in __init__
  50. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/dataclasses.py", line 1025, in __post_init__
  51. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/dataclasses.py", line 1025, in __post_init__
  52. self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config)self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config)
  53.  
  54. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 54, in __init__
  55. File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 54, in __init__
  56. raise ValueError(raise ValueError(
  57.  
  58. ValueErrorValueError: : Expected a string path to an existing deepspeed config, or a dictionary, or a base64 encoded string. Received: ./deepspeed.jsonExpected a string path to an existing deepspeed config, or a dictionary, or a base64 encoded string. Received: ./deepspeed.json
  59.  
  60. W0927 13:40:36.132854 140401003063104 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10890 closing signal SIGTERM
  61. E0927 13:40:36.133734 140401003063104 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 10891) of binary: /usr/bin/python
  62. Traceback (most recent call last):
  63. File "/usr/local/bin/accelerate", line 8, in <module>
  64. sys.exit(main())
  65. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
  66. args.func(args)
  67. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
  68. deepspeed_launcher(args)
  69. File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
  70. distrib_run.run(args)
  71. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
  72. elastic_launch(
  73. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
  74. return launch_agent(self._config, self._entrypoint, list(args))
  75. File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  76. raise ChildFailedError(
  77. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  78. ============================================================
  79. examples/dreambooth/train_dreambooth_flux.py FAILED
  80. ------------------------------------------------------------
  81. Failures:
  82. <NO_OTHER_FAILURES>
  83. ------------------------------------------------------------
  84. Root Cause (first observed failure):
  85. [0]:
  86. time : 2024-09-27_13:40:36
  87. host : x2-h100.internal.cloudapp.net
  88. rank : 1 (local_rank: 1)
  89. exitcode : 1 (pid: 10891)
  90. error_file: <N/A>
  91. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  92. ============================================================
Add Comment
Please, Sign In to add comment