Requesting help with usage of `llama_batch` and `llama_decode` via Python ctypes #10882

ddh0 · 2024-12-18T05:26:40Z

ddh0
Dec 18, 2024

Hi there!

I'm having some trouble when using the libllama API from Python via ctypes, and I'd really appreciate it if anybody could help me figure this out. I'll try to explain the issue as effectively as I can within this post, but if you'd like to look at the full code, it's these two files:

Setup

The model I'm using is a q6_K GGUF quant of Llama-3.1-8B-Instruct. The quant is confirmed working with llama-cli and other llama.cpp examples. I'm loading it with 8192 n_ctx and 2048 n_batch.

The prompt that I'm using to test the model is as follows. The \n characters are actually newlines, and not a literal "\n" string. I'm using the Llama 3 instruct template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHello, tell me a short story about two potatoes in love.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n

This tokenizes to:

[128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 13, 128009, 198, 128006, 882, 128007, 271, 9906, 11, 3371, 757, 264, 2875, 3446, 922, 1403, 35267, 304, 3021, 13, 128009, 198, 128006, 78191, 128007, 271]

... which is correct. 👍 So far so good. This list[int] of tokens is stored in a variable called tokens from here on out.

Next, I set up the batch:

def set(self, batch: Iterable[int], n_past: int, logits_all: bool):
        n_tokens = len(batch)
        self.batch = lib.llama_batch_init(
            n_tokens=n_tokens,
            embd=0,
            n_seq_max=1
        )
        null_ptr_check(self.batch, 'self.batch', '_LlamaBatch.set')
        self.batch.n_tokens = n_tokens
        for i in range(n_tokens):
            self.batch.token[i]     = batch[i]
            self.batch.pos[i]       = n_past + i
            self.batch.seq_id[i][0] = 0
            self.batch.n_seq_id[i]  = 1
            self.batch.logits[i]    = logits_all
        self.batch.logits[n_tokens - 1] = True

TestLlama._batch.set(batch=tokens, n_past=TestLlama.pos, logits_all=True)

Then I call llama_decode with this batch:

def decode(self, batch: _LlamaBatch) -> None:
    """
    Decode a batch of tokens
    """
    null_ptr_check(self._ctx.ctx, 'self._ctx.ctx', 'Llama.decode')
    null_ptr_check(batch.batch, 'batch.batch', 'Llama.decode')
    ret = lib.llama_decode(self._ctx.ctx, batch.batch)
    if ret == 0:
        print_info(
            f'Llama.decode: llama_decode success'
        )
    else:
        raise RuntimeError

TestLlama.decode(TestLlama._batch)

llama_decode returns 0 - all good. No errors or warnings in the terminal. 👍

The problem

The problem is that after calling llama_decode, the logits are not what I expect. Whether I get the logits via llama_get_logits, or llama_get_logits_ith, or even if I initialize and use a greedy sampler, the top token ID is always 1839. This token is ' href', which does not make sense as the first word in a story about two potatoes in love. Using this model, and using the above tokens as the only context, I consistently get ' href' as the most likely next token, no matter how many times I try.

I would expect that the output would be something like "Once" (as in "Once upon a time..."), "Sure" (as in "Sure, here's a silly story..."), or something like that. Not " href".

What I've tried

Using `logits_all = True`:

type(logits)=<class 'numpy.ndarray'>
len(logits)=37
np.shape(logits)=(37, 128256)
np.argmax(logits[-1])=1839
self.token_to_piece(np.argmax(logits[-1]),special=True)=b' href'

Using `logits_all = False`:

type(logits)=<class 'numpy.ndarray'>
len(logits)=1
np.shape(logits)=(1, 128256)
np.argmax(logits[-1])=1839
self.token_to_piece(np.argmax(logits[-1]),special=True)=b' href'

Using a `llama_sampler`, like so:

sampler = lib.llama_sampler_init_greedy()
id = lib.llama_sampler_sample(smpl=sampler, ctx=TestLlama._ctx.ctx, idx=-1)
print('sampled token: ' + repr(TestLlama.token_to_piece(token=id, special=True).decode()))

... which prints:

sampled token: ' href'

Using a different model:

Replacing Llama-3.1-8B-Instruct with Llama-3.2-1B-Instruct:

sampled token: ',' # token ID 11

My question

Why is this happening? I think I'm setting up the batch correctly. llama_decode returns 0. No errors in the terminal. Several different methods all yield the same nonsensical result. I've been trying to fix this for like 2 days straight and I'm just not sure what else to try at this point.

I'm hoping someone smarter than me will be able to chime in and point me in the right direction. To that end, please excuse the following behaviour: @compilade @slaren @ngxson @JohannesGaessler @bartowski1182. I think you all know the codebase better than I do and if you could spare a few minutes to look over this issue, I'd be really grateful. If not, feel free to ignore. In either case, thank you for reading, and have a nice day. :)

ddh0 · 2024-12-18T07:18:38Z

ddh0
Dec 18, 2024
Author

full_llama_test_output.txt

Here's the full output of running llama.py. Might be helpful also

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting help with usage of `llama_batch` and `llama_decode` via Python ctypes #10882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Requesting help with usage of llama_batch and llama_decode via Python ctypes #10882

ddh0 Dec 18, 2024

Setup

The problem

What I've tried

Using logits_all = True:

Using logits_all = False:

Using a llama_sampler, like so:

Using a different model:

My question

Replies: 1 comment

ddh0 Dec 18, 2024 Author

Requesting help with usage of `llama_batch` and `llama_decode` via Python ctypes #10882

ddh0
Dec 18, 2024

Using `logits_all = True`:

Using `logits_all = False`:

Using a `llama_sampler`, like so:

ddh0
Dec 18, 2024
Author