Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ***
- Welcome to KoboldCpp - Version 1.79.1.yr0-ROCm
- For command line arguments, please refer to --help
- ***
- Auto Selected HIP Backend...
- Auto Recommended GPU Layers: 24
- Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
- Initializing dynamic library: koboldcpp_hipblas.dll
- ==========
- Namespace(model='', model_param='E:/LargeLanguageModels/EVA-Qwen2.5-32B-v0.2-Q4_K_S.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=['normal', '0'], usevulkan=None, useclblast=None, usecpu=False, contextsize=16384, gpulayers=24, tensor_split=None, checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=False, nofastforward=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, multiplayer=False, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, draftmodel=None, draftamount=8, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=False, quantkv=0, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdt5xxl='', sdclipl='', sdclipg='', sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)
- ==========
- Loading model: E:\LargeLanguageModels\EVA-Qwen2.5-32B-v0.2-Q4_K_S.gguf
- The reported GGUF Arch is: qwen2
- Arch Category: 5
- ---
- Identified as GGUF model: (ver 6)
- Attempting to Load...
- ---
- Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
- It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
- System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
- CUBLAS: Warning, you are running Qwen2 without Flash Attention and may observe incoherent output.
- ---
- Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
- ---
- ggml_cuda_init: found 1 ROCm devices:
- Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no
- llama_load_model_from_file: using device ROCm0 (AMD Radeon RX 7900 XT) - 20106 MiB free
- llama_model_loader: loaded meta data with 37 key-value pairs and 771 tensors from E:\LargeLanguageModels\EVA-Qwen2.5-32B-v0.2-Q4Ƀ^C²llm_load_vocab: special tokens cache size = 22
- llm_load_vocab: token to piece cache size = 0.9310 MB
- llm_load_print_meta: format = GGUF V3 (latest)
- llm_load_print_meta: arch = qwen2
- llm_load_print_meta: vocab type = BPE
- llm_load_print_meta: n_vocab = 152064
- llm_load_print_meta: n_merges = 151387
- llm_load_print_meta: vocab_only = 0
- llm_load_print_meta: n_ctx_train = 131072
- llm_load_print_meta: n_embd = 5120
- llm_load_print_meta: n_layer = 64
- llm_load_print_meta: n_head = 40
- llm_load_print_meta: n_head_kv = 8
- llm_load_print_meta: n_rot = 128
- llm_load_print_meta: n_swa = 0
- llm_load_print_meta: n_embd_head_k = 128
- llm_load_print_meta: n_embd_head_v = 128
- llm_load_print_meta: n_gqa = 5
- llm_load_print_meta: n_embd_k_gqa = 1024
- llm_load_print_meta: n_embd_v_gqa = 1024
- llm_load_print_meta: f_norm_eps = 0.0e+00
- llm_load_print_meta: f_norm_rms_eps = 1.0e-05
- llm_load_print_meta: f_clamp_kqv = 0.0e+00
- llm_load_print_meta: f_max_alibi_bias = 0.0e+00
- llm_load_print_meta: f_logit_scale = 0.0e+00
- llm_load_print_meta: n_ff = 27648
- llm_load_print_meta: n_expert = 0
- llm_load_print_meta: n_expert_used = 0
- llm_load_print_meta: causal attn = 1
- llm_load_print_meta: pooling type = 0
- llm_load_print_meta: rope type = 2
- llm_load_print_meta: rope scaling = linear
- llm_load_print_meta: freq_base_train = 1000000.0
- llm_load_print_meta: freq_scale_train = 1
- llm_load_print_meta: n_ctx_orig_yarn = 131072
- llm_load_print_meta: rope_finetuned = unknown
- llm_load_print_meta: ssm_d_conv = 0
- llm_load_print_meta: ssm_d_inner = 0
- llm_load_print_meta: ssm_d_state = 0
- llm_load_print_meta: ssm_dt_rank = 0
- llm_load_print_meta: ssm_dt_b_c_rms = 0
- llm_load_print_meta: model type = 32B
- llm_load_print_meta: model ftype = all F32
- llm_load_print_meta: model params = 32.76 B
- llm_load_print_meta: model size = 17.49 GiB (4.59 BPW)
- llm_load_print_meta: general.name = Qwen2.5 32B
- llm_load_print_meta: BOS token = 11 ','
- llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
- llm_load_print_meta: EOT token = 151645 '<|im_end|>'
- llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
- llm_load_print_meta: LF token = 148848 'ÄĬ'
- llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
- llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
- llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
- llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
- llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
- llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
- llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
- llm_load_print_meta: EOG token = 151645 '<|im_end|>'
- llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
- llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
- llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
- llm_load_print_meta: max token length = 256
- llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 482 others) cannot be used with preferred buffer type ROCm_Host, using ðÿoF²(This is not an error, it just means some tensors will use CPU instead.)
- llm_load_tensors: offloading 24 repeating layers to GPU
- llm_load_tensors: offloaded 24/65 layers to GPU
- llm_load_tensors: CPU_Mapped model buffer size = 11629.41 MiB
- llm_load_tensors: ROCm0 model buffer size = 6279.09 MiB
- .................................................................................................
- Automatic RoPE Scaling: Using model internal value.
- llama_new_context_with_model: n_seq_max = 1
- llama_new_context_with_model: n_ctx = 16512
- llama_new_context_with_model: n_ctx_per_seq = 16512
- llama_new_context_with_model: n_batch = 512
- llama_new_context_with_model: n_ubatch = 512
- llama_new_context_with_model: flash_attn = 0
- llama_new_context_with_model: freq_base = 1000000.0
- llama_new_context_with_model: freq_scale = 1
- llama_new_context_with_model: n_ctx_per_seq (16512) < n_ctx_train (131072) -- the full capacity of the model will not be utilizep¿¥▄llama_kv_cache_init: CPU KV buffer size = 2580.00 MiB
- llama_kv_cache_init: ROCm0 KV buffer size = 1548.00 MiB
- llama_new_context_with_model: KV self size = 4128.00 MiB, K (f16): 2064.00 MiB, V (f16): 2064.00 MiB
- llama_new_context_with_model: CPU output buffer size = 0.58 MiB
- llama_new_context_with_model: ROCm0 compute buffer size = 1416.77 MiB
- llama_new_context_with_model: ROCm_Host compute buffer size = 42.26 MiB
- llama_new_context_with_model: graph nodes = 2246
- llama_new_context_with_model: graph splits = 564 (with bs=512), 3 (with bs=1)
- Load Text Model OK: True
- Embedded KoboldAI Lite loaded.
- Embedded API docs loaded.
- Starting Kobold API on port 5001 at http://localhost:5001/api/
- Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
- ======
- Please connect to custom endpoint at http://localhost:5001
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement