Advertisement
kopyl

with torch.distributed.run

Jul 21st, 2023 (edited)
101
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.10 KB | None | 0 0
  1. | distributed init (rank 0, world 1): env://
  2. 2023-07-21 12:04:15,964 [INFO]
  3. ===== Running Parameters =====
  4. 2023-07-21 12:04:15,965 [INFO] {
  5. "amp": true,
  6. "batch_size_eval": 1,
  7. "batch_size_train": 3,
  8. "device": "cuda",
  9. "dist_backend": "nccl",
  10. "dist_url": "env://",
  11. "distributed": true,
  12. "evaluate": false,
  13. "gpu": 0,
  14. "init_lr": 5e-06,
  15. "iters_per_inner_epoch": 40,
  16. "lr_sched": "constant_lr",
  17. "max_iters": 40,
  18. "min_lr": 0,
  19. "num_workers": 4,
  20. "output_dir": "train_output",
  21. "rank": 0,
  22. "resume_ckpt_path": null,
  23. "runner": "runner_iter",
  24. "seed": 42,
  25. "task": "text-to-image-generation",
  26. "train_splits": [
  27. "train"
  28. ],
  29. "weight_decay": 0.01,
  30. "world_size": 1
  31. }
  32. 2023-07-21 12:04:15,965 [INFO]
  33. ====== Dataset Attributes ======
  34. 2023-07-21 12:04:15,965 [INFO]
  35. ======== blip_diffusion_finetune =======
  36. 2023-07-21 12:04:15,966 [INFO] {
  37. "build_info": {
  38. "images": {
  39. "storage": "train_images"
  40. },
  41. "subject_text": "feigin"
  42. },
  43. "data_type": "images",
  44. "kw_processor": {
  45. "inp_vis_processor": {
  46. "name": "blip_diffusion_inp_image_train"
  47. },
  48. "tgt_vis_processor": {
  49. "name": "blip_diffusion_tgt_image_train"
  50. }
  51. },
  52. "text_processor": {
  53. "eval": {
  54. "name": "blip_caption"
  55. },
  56. "train": {
  57. "name": "blip_caption"
  58. }
  59. }
  60. }
  61. 2023-07-21 12:04:15,966 [INFO]
  62. ====== Model Attributes ======
  63. 2023-07-21 12:04:15,966 [INFO] {
  64. "arch": "blip_diffusion",
  65. "load_finetuned": false,
  66. "load_pretrained": true,
  67. "model_type": "base",
  68. "pretrained": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP-Diffusion/blip-diffusion.tar.gz",
  69. "qformer_cross_attention_freq": 1,
  70. "qformer_num_query_token": 16,
  71. "qformer_train": false,
  72. "sd_pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5",
  73. "sd_train_text_encoder": false,
  74. "vae_half_precision": true,
  75. "vit_model": "clip_L"
  76. }
  77. /workspace/LAVIS/lavis/datasets/builders/base_dataset_builder.py:164: UserWarning:
  78. The specified path /export/home/.cache/lavis/train_images for visual inputs does not exist.
  79. Please provide a correct path to the visual inputs or
  80. refer to datasets/download_scripts/README.md for downloading instructions.
  81.  
  82. warnings.warn(
  83. 2023-07-21 12:04:15,968 [INFO] Building datasets...
  84. 2023-07-21 12:04:18,763 [INFO] freeze vision encoder
  85. Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
  86. ```
  87. pip install accelerate
  88. ```
  89. .
  90. Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
  91. ```
  92. pip install accelerate
  93. ```
  94. .
  95. /usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py:215: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  96. deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
  97. 2023-07-21 12:04:30,695 [INFO] Loading pretrained model from /root/.cache/torch/hub/checkpoints/blip-diffusion
  98. No ctx_embeddings_cache found in /root/.cache/torch/hub/checkpoints/blip-diffusion
  99. 2023-07-21 12:04:33,740 [INFO] Start training, max_iters=40, in total 1 inner epochs.
  100. 2023-07-21 12:04:34,531 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
  101. 2023-07-21 12:04:34,532 [INFO] Loaded 2200000 records for train split from the dataset.
  102. 2023-07-21 12:04:34,574 [INFO] number of trainable parameters: 859520964
  103. 2023-07-21 12:04:34,575 [INFO] Start training epoch 0, 40 iters per inner epoch.
  104. Traceback (most recent call last):
  105. File "/workspace/LAVIS/train.py", line 103, in <module>
  106. main()
  107. File "/workspace/LAVIS/train.py", line 99, in main
  108. runner.train()
  109. File "/workspace/LAVIS/lavis/runners/runner_iter.py", line 99, in train
  110. train_stats = self.train_iters(self.cur_epoch, start_iters)
  111. File "/workspace/LAVIS/lavis/runners/runner_iter.py", line 145, in train_iters
  112. return self.task.train_iters(
  113. File "/workspace/LAVIS/lavis/tasks/base_task.py", line 144, in train_iters
  114. return self._train_inner_loop(
  115. File "/workspace/LAVIS/lavis/tasks/base_task.py", line 222, in _train_inner_loop
  116. loss, loss_dict = self.train_step(model=model, samples=samples)
  117. File "/workspace/LAVIS/lavis/tasks/base_task.py", line 64, in train_step
  118. output = model(samples)
  119. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  120. return forward_call(*args, **kwargs)
  121. File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward
  122. output = self._run_ddp_forward(*inputs, **kwargs)
  123. File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
  124. return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
  125. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  126. return forward_call(*args, **kwargs)
  127. File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 363, in _forward_unimplemented
  128. raise NotImplementedError(f"Module [{type(self).__name__}] is missing the required \"forward\" function")
  129. NotImplementedError: Module [BlipDiffusion] is missing the required "forward" function
  130. Exception in thread Thread-1 (_pin_memory_loop):
  131. Traceback (most recent call last):
  132. File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  133. self.run()
  134. File "/usr/lib/python3.10/threading.py", line 953, in run
  135. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2413) of binary: /usr/bin/python
  136. Traceback (most recent call last):
  137. File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
  138. return _run_code(code, main_globals, None,
  139. File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
  140. exec(code, run_globals)
  141. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 798, in <module>
  142. main()
  143. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
  144. return f(*args, **kwargs)
  145. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
  146. run(args)
  147. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
  148. elastic_launch(
  149. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
  150. return launch_agent(self._config, self._entrypoint, list(args))
  151. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
  152. raise ChildFailedError(
  153. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  154. ============================================================
  155. train.py FAILED
  156. ------------------------------------------------------------
  157. Failures:
  158. <NO_OTHER_FAILURES>
  159. ------------------------------------------------------------
  160. Root Cause (first observed failure):
  161. [0]:
  162. time : 2023-07-21_12:04:40
  163. host : badad68d43b7
  164. rank : 0 (local_rank: 0)
  165. exitcode : 1 (pid: 2413)
  166. error_file: <N/A>
  167. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  168. ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement