Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- | distributed init (rank 0, world 1): env://
- 2023-07-21 12:04:15,964 [INFO]
- ===== Running Parameters =====
- 2023-07-21 12:04:15,965 [INFO] {
- "amp": true,
- "batch_size_eval": 1,
- "batch_size_train": 3,
- "device": "cuda",
- "dist_backend": "nccl",
- "dist_url": "env://",
- "distributed": true,
- "evaluate": false,
- "gpu": 0,
- "init_lr": 5e-06,
- "iters_per_inner_epoch": 40,
- "lr_sched": "constant_lr",
- "max_iters": 40,
- "min_lr": 0,
- "num_workers": 4,
- "output_dir": "train_output",
- "rank": 0,
- "resume_ckpt_path": null,
- "runner": "runner_iter",
- "seed": 42,
- "task": "text-to-image-generation",
- "train_splits": [
- "train"
- ],
- "weight_decay": 0.01,
- "world_size": 1
- }
- 2023-07-21 12:04:15,965 [INFO]
- ====== Dataset Attributes ======
- 2023-07-21 12:04:15,965 [INFO]
- ======== blip_diffusion_finetune =======
- 2023-07-21 12:04:15,966 [INFO] {
- "build_info": {
- "images": {
- "storage": "train_images"
- },
- "subject_text": "feigin"
- },
- "data_type": "images",
- "kw_processor": {
- "inp_vis_processor": {
- "name": "blip_diffusion_inp_image_train"
- },
- "tgt_vis_processor": {
- "name": "blip_diffusion_tgt_image_train"
- }
- },
- "text_processor": {
- "eval": {
- "name": "blip_caption"
- },
- "train": {
- "name": "blip_caption"
- }
- }
- }
- 2023-07-21 12:04:15,966 [INFO]
- ====== Model Attributes ======
- 2023-07-21 12:04:15,966 [INFO] {
- "arch": "blip_diffusion",
- "load_finetuned": false,
- "load_pretrained": true,
- "model_type": "base",
- "pretrained": "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP-Diffusion/blip-diffusion.tar.gz",
- "qformer_cross_attention_freq": 1,
- "qformer_num_query_token": 16,
- "qformer_train": false,
- "sd_pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5",
- "sd_train_text_encoder": false,
- "vae_half_precision": true,
- "vit_model": "clip_L"
- }
- /workspace/LAVIS/lavis/datasets/builders/base_dataset_builder.py:164: UserWarning:
- The specified path /export/home/.cache/lavis/train_images for visual inputs does not exist.
- Please provide a correct path to the visual inputs or
- refer to datasets/download_scripts/README.md for downloading instructions.
- warnings.warn(
- 2023-07-21 12:04:15,968 [INFO] Building datasets...
- 2023-07-21 12:04:18,763 [INFO] freeze vision encoder
- Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
- ```
- pip install accelerate
- ```
- .
- Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
- ```
- pip install accelerate
- ```
- .
- /usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py:215: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
- deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
- 2023-07-21 12:04:30,695 [INFO] Loading pretrained model from /root/.cache/torch/hub/checkpoints/blip-diffusion
- No ctx_embeddings_cache found in /root/.cache/torch/hub/checkpoints/blip-diffusion
- 2023-07-21 12:04:33,740 [INFO] Start training, max_iters=40, in total 1 inner epochs.
- 2023-07-21 12:04:34,531 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
- 2023-07-21 12:04:34,532 [INFO] Loaded 2200000 records for train split from the dataset.
- 2023-07-21 12:04:34,574 [INFO] number of trainable parameters: 859520964
- 2023-07-21 12:04:34,575 [INFO] Start training epoch 0, 40 iters per inner epoch.
- Traceback (most recent call last):
- File "/workspace/LAVIS/train.py", line 103, in <module>
- main()
- File "/workspace/LAVIS/train.py", line 99, in main
- runner.train()
- File "/workspace/LAVIS/lavis/runners/runner_iter.py", line 99, in train
- train_stats = self.train_iters(self.cur_epoch, start_iters)
- File "/workspace/LAVIS/lavis/runners/runner_iter.py", line 145, in train_iters
- return self.task.train_iters(
- File "/workspace/LAVIS/lavis/tasks/base_task.py", line 144, in train_iters
- return self._train_inner_loop(
- File "/workspace/LAVIS/lavis/tasks/base_task.py", line 222, in _train_inner_loop
- loss, loss_dict = self.train_step(model=model, samples=samples)
- File "/workspace/LAVIS/lavis/tasks/base_task.py", line 64, in train_step
- output = model(samples)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward
- output = self._run_ddp_forward(*inputs, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
- return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
- return forward_call(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 363, in _forward_unimplemented
- raise NotImplementedError(f"Module [{type(self).__name__}] is missing the required \"forward\" function")
- NotImplementedError: Module [BlipDiffusion] is missing the required "forward" function
- Exception in thread Thread-1 (_pin_memory_loop):
- Traceback (most recent call last):
- File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
- self.run()
- File "/usr/lib/python3.10/threading.py", line 953, in run
- ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2413) of binary: /usr/bin/python
- Traceback (most recent call last):
- File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
- return _run_code(code, main_globals, None,
- File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
- exec(code, run_globals)
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 798, in <module>
- main()
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
- return f(*args, **kwargs)
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
- run(args)
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
- elastic_launch(
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- train.py FAILED
- ------------------------------------------------------------
- Failures:
- <NO_OTHER_FAILURES>
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2023-07-21_12:04:40
- host : badad68d43b7
- rank : 0 (local_rank: 0)
- exitcode : 1 (pid: 2413)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement