koboldcpp_rocm log

***
Welcome to KoboldCpp - Version 1.79.1.yr0-ROCm
For command line arguments, please refer to --help
***
Auto Selected HIP Backend...

Auto Recommended GPU Layers: 24
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
Initializing dynamic library: koboldcpp_hipblas.dll
==========
Namespace(model='', model_param='E:/LargeLanguageModels/EVA-Qwen2.5-32B-v0.2-Q4_K_S.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=['normal', '0'], usevulkan=None, useclblast=None, usecpu=False, contextsize=16384, gpulayers=24, tensor_split=None, checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=False, nofastforward=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, multiplayer=False, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, draftmodel=None, draftamount=8, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=False, quantkv=0, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdt5xxl='', sdclipl='', sdclipg='', sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)
==========
Loading model: E:\LargeLanguageModels\EVA-Qwen2.5-32B-v0.2-Q4_K_S.gguf

The reported GGUF Arch is: qwen2
Arch Category: 5

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
CUBLAS: Warning, you are running Qwen2 without Flash Attention and may observe incoherent output.
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
---
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no
llama_load_model_from_file: using device ROCm0 (AMD Radeon RX 7900 XT) - 20106 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 771 tensors from E:\LargeLanguageModels\EVA-Qwen2.5-32B-v0.2-Q4Éƒ^C²llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 17.49 GiB (4.59 BPW)
llm_load_print_meta: general.name     = Qwen2.5 32B
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '├ä─¼'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 482 others) cannot be used with preferred buffer type ROCm_Host, using ðÿoF²(This is not an error, it just means some tensors will use CPU instead.)
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/65 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size = 11629.41 MiB
llm_load_tensors:        ROCm0 model buffer size =  6279.09 MiB
.................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16512
llama_new_context_with_model: n_ctx_per_seq = 16512
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16512) < n_ctx_train (131072) -- the full capacity of the model will not be utilizep¿¥▄llama_kv_cache_init:        CPU KV buffer size =  2580.00 MiB
llama_kv_cache_init:      ROCm0 KV buffer size =  1548.00 MiB
llama_new_context_with_model: KV self size  = 4128.00 MiB, K (f16): 2064.00 MiB, V (f16): 2064.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =  1416.77 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    42.26 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 564 (with bs=512), 3 (with bs=1)
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001