Replies: 2 comments
-
llama.cpp provides quantize tool to convert fp16 checkpoint to lower precision |
Beta Was this translation helpful? Give feedback.
-
I clone starcoder from then convert to it to gguf file format by using convert-hf-to-gguf.py from llama.cpp.
finally, when I try to run starcoder1 with llama.cpp, it failed.
log : Log start
main: build = 1699 (b9f4795)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1703645466
llama_model_loader: loaded meta data with 17 key-value pairs and 292 tensors from startcoder1b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = starcoder
llama_model_loader: - kv 1: general.name str = StarCoder
llama_model_loader: - kv 2: starcoder.context_length u32 = 8192
llama_model_loader: - kv 3: starcoder.embedding_length u32 = 2048
llama_model_loader: - kv 4: starcoder.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: starcoder.block_count u32 = 24
llama_model_loader: - kv 6: starcoder.attention.head_count u32 = 16
llama_model_loader: - kv 7: starcoder.attention.head_count_kv u32 = 1
llama_model_loader: - kv 8: starcoder.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 14: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 0
llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - type f32: 194 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: special tokens definition check successful ( 19/49152 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = starcoder
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48891
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 1
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 16
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.14 B
llm_load_print_meta: model size = 2.12 GiB (16.01 BPW)
llm_load_print_meta: general.name = StarCoder
llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: LF token = 145 'Ä'
llm_load_tensors: ggml ctx size = 0.11 MiB
error loading model: create_tensor: tensor 'output.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'startcoder1b.gguf'
main: error: unable to load model So, Why did it fail? how you convert starcoder to gguf ? |
Beta Was this translation helpful? Give feedback.
-
Hello,
After reviewing the documentation on the model registry, I understand that Tabby has been using llama.cpp for inference since version 0.5.0, which supports GGUF format model files. However, I've noticed that Tabby seems to support 8-bit quantization, which differs from the convert-hf-to-gguf.py script provided within llama.cpp that only supports float16 and float32.
Could you please explain how the StartCoderBase-1B model was converted into the 8-bit quantized format q8_0.v2.gguf?
Thank you for your assistance.
Beta Was this translation helpful? Give feedback.
All reactions