What does the prompt context mean? #1838

siddhsql · 2023-06-13T16:00:26Z

siddhsql
Jun 13, 2023

Hello, I would like to understand what does the prompt context mean:

-c N, --ctx-size N: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.

Questions:

Does it mean when I give the program a prompt, it will truncate it to 512 tokens?
Does the prompt context define the batch size in transformer processing? the prompt itself can be as long as I want but it will be chunked into batches of size 512 tokens?
Being an auto regressive model, is the prompt getting continuously appended to as the chat progresses with Q&A? i.e., when user types in a new question - the prompt includes not only that question but the entire chat history preceding that question? Is there any limit to how big it can get before it started getting truncated?

sorry if the questions are basic for experts. I am trying to learn the field.

Answered by KerfuffleV2

Jun 15, 2023

The main example/common part is complicated code designed to handle a bunch of features at the same time. I can't really tell specifically what it does, or what every variable is for. (Not because I don't want to, but I didn't write it and haven't interacted with it much). You'll have to figure that out yourself.

"State" exists in a number of places. Something like keeping track of how many tokens have been evaluated is a type of state. There are also tensors (in the context) for holding the model's actual state (you could call that its short term memory).

don't we also need to clear out ctx?

The point of that code isn't to just completely reset the state, like you'd restarted the progr…

View full answer

KerfuffleV2 · 2023-06-14T13:44:23Z

KerfuffleV2
Jun 14, 2023
Collaborator

Does it mean when I give the program a prompt, it will truncate it to 512 tokens?

You'll probably get an error. Your prompt plus any output the LLM generates has to fit in the context size.

Does the prompt context define the batch size in transformer processing?

There's a separate parameter for batch size (used for feeding the LLM your prompt, not generating tokens).

Being an auto regressive model, is the prompt getting continuously appended to as the chat progresses with Q&A?

This part doesn't really have anything to do with the model. The prompt and output is more like a shared editor, the chat or Q&A part is basically an illusion. The main example here allows doing stuff like returning control to the user when certain tokens are seen, adding a prefix/suffix to the user's input in interactive mode, etc.

So there isn't a definite answer here, it will depend on what options you are using.

2 replies

siddhsql Jun 14, 2023
Author

Your prompt plus any output the LLM generates has to fit in the context size.

I have definitely seen chat sessions where the length of chat goes beyond 512 words. How do you explain that?

the chat or Q&A part is basically an illusion.

You mean as far as LLM is concerned, to it the whole chat is one long document. The LLM is not aware that we are doing a Q&A?

Is the prompt size a sliding window? E.g., if I set prompt size to 512 tokens then the LLM uses the last 512 tokens to decide what next word to predict? It does not seem to match my observation because to me it feels the LLM keeps context of what I had asked it that goes outside 512 size window.

KerfuffleV2 Jun 14, 2023
Collaborator

I have definitely seen chat sessions where the length of chat goes beyond 512 words.

Sorry I wasn't clear. 512 is the default context size, but you can set it higher. Most LLaMA models are trained at 2048. You can set it higher than that, but generally that's where the output gets completely incoherent.

Also note that setting it higher can use more memory and affect performance.

The LLM is not aware that we are doing a Q&A?

That's not an easy question to answer because what does "the LLM is/isn't aware" even mean?

Depending on the model, it may have been trained a specific format like:

### Instruction: Some question.
### Response: Some response.

You may be able to have more than one of those instruction/response pairs in the context to simulate a conversation. However, the LLM will happily generate both sides of the exchange if you let it. That's what I meant about there not being a real distinction.

Software like llama.cpp (if configured) can watch for the LLM writing ### Instruction: and return control to the user at that point so can have a conversation but that's not really part of the model itself if that makes any sense.

E.g., if I set prompt size to 512 tokens then the LLM uses the last 512 tokens to decide what next word to predict?

Generally it uses the whole context. There are some special tokens that can change how the LLM behaves though. For LLaMA models there's a start of document token, end of document token, etc. If you let the model generate the EOD token then it may ignore stuff above that point.

The main example also has a --keep N argument which lets you generate more tokens than the context size. It will copy some tokens from the prompt back into the context and let generation continue. Of course this is just taking the full buffer and slapping the prompt on top of it: then the LLM continues from that state. It may or may not produce something coherent - partially because a chunk of the context is missing, and partly because it chopped at a pretty arbitrary place which may be in the middle of a word or sentence.

siddhsql · 2023-06-14T22:12:32Z

siddhsql
Jun 14, 2023
Author

i don't get it. what's the purpose of this code: ``` // infinite text generation via context swapping // if we run out of context: // - take the n_keep first tokens from the original prompt (via n_past) // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches if (n_past + (int) embd.size() > n_ctx) { const int n_left = n_past - params.n_keep; // always keep the first token - BOS n_past = std::max(1, params.n_keep); // insert n_left/2 tokens at the start of embd from last_n_tokens embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd. size(), last_n_tokens.end() - embd.size()); // stop saving session if we run out of context path_session.clear(); printf("\n---\n"); printf("resetting: '"); for (int i = 0; i < (int) embd.size(); i++) { printf("%s", llama_token_to_str(ctx, embd[i])); } printf("'\n"); printf("\n---\n"); } ``` why do we need to do this?

…

On Wed, Jun 14, 2023 at 8:20 AM Kerfuffle ***@***.***> wrote: I have definitely seen chat sessions where the length of chat goes beyond 512 words. Sorry I wasn't clear. 512 is the default context size, but you can set it higher. Most LLaMA models are trained at 2048. You *can* set it higher than that, but generally that's where the output gets completely incoherent. Also note that setting it higher can use more memory and affect performance. The LLM is not aware that we are doing a Q&A? That's not an easy question to answer because what does "the LLM is/isn't aware" even mean? Depending on the model, it may have been trained a specific format like: ### Instruction: Some question. ### Response: Some response. You may be able to have more than one of those instruction/response pairs in the context to simulate a conversation. However, the LLM will happily generate both sides of the exchange if you let it. That's what I meant about there not being a *real* distinction. Software like llama.cpp (if configured) can watch for the LLM writing ### Instruction: and return control to the user at that point so can have a conversation but that's not really part of the model itself if that makes any sense. E.g., if I set prompt size to 512 tokens then the LLM uses the last 512 tokens to decide what next word to predict? Generally it uses the whole context. There are some special tokens that can change how the LLM behaves though. For LLaMA models there's a start of document token, end of document token, etc. If you let the model generate the EOD token then it may ignore stuff above that point. The main example also has a --keep N argument which lets you generate more tokens than the context size. It will copy some tokens from the prompt back into the context and let generation continue. Of course this is just taking the full buffer and slapping the prompt on top of it: then the LLM continues from that state. It may or may not produce something coherent - partially because a chunk of the context is missing, and partly because it chopped at a pretty arbitrary place which may be in the middle of a word or sentence. — Reply to this email directly, view it on GitHub <#1838 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6NWEKYJGSVLI4GOJKJW6ULXLHJENANCNFSM6AAAAAAZFDN42I> . You are receiving this because you authored the thread.Message ID: ***@***.***>

3 replies

KerfuffleV2 Jun 14, 2023
Collaborator

That's what I was talking in the last paragraph of my response above. It's the logic for --keep.

It allows you to keep generating tokens past the normal context limit (possibly infinitely) but it does that by overwriting part of the context with the prompt and generating new tokens into that context. It's not the same as having infinite context length.

siddhsql Jun 14, 2023
Author

so n_ctx seems to be the length of the entire conversation. is that correct?

I commented out this whole code as it seems super weird to me. If I do that I get a SEGFAULT once the conversation goes beyond n_ctx. Is that expected?

Is this code a stop-gap/hack/illusion to let the conversation going beyond n_ctx? In reality it is creating an entirely new conversation behind the scenes. Is that what it is doing?

This is super weird. Is this peculiar to llama.cpp? In ChatGPT I can keep on having a conversation for as long as I want.

KerfuffleV2 Jun 14, 2023
Collaborator

If I do that I get a SEGFAULT once the conversation goes beyond n_ctx. Is that expected?

Uhh, yeah. If you randomly modify code you don't understand, especially code that's meant to handle some exceptional condition (like running out of context space) then you will get unpredictable results.

This is a "Doctor, doctor, it hurts when I poke myself in the eye" kind of situation. Don't poke yourself in the eye.

Is this code a stop-gap/hack/illusion to let the conversation going beyond

Yes, it's a way to let the conversation go past the normal context size limit by copying part of the initial prompt onto the existing state and then starting generation past that point. Since it's overwriting part of the conversation with that prompt, the results may or may not be good.

There's really no point in commenting it out though: it doesn't do anything until you hit the context size and once you hit the context size you basically have to use a trick like that or just stop generation and exit.

In ChatGPT I can keep on having a conversation for as long as I want.

Not really. First, ChatGPT 3.5 has a context limit around 4 times the size of LLaMA models (and ChatGPT 4 is more like 8x) so you're less likely to have a conversation go on that long. OpenAI also does stuff behind the scenes like automatically summarizing the existing conversation and start generation from that point. Since it "compresses" the existing context, you can potentially go for even longer although some information may be lost/changed in that process.

siddhsql · 2023-06-15T19:09:26Z

siddhsql
Jun 15, 2023
Author

thanks.

Going back to the code:

if (n_past + (int) embd.size() > n_ctx) {
                const int n_left = n_past - params.n_keep;

                // always keep the first token - BOS
                n_past = std::max(1, params.n_keep);

                // insert n_left/2 tokens at the start of embd from last_n_tokens
                embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());

                // stop saving session if we run out of context
                path_session.clear();

don't we also need to clear out ctx? It seems to me that ctx is a stateful object. is that not so? e.g., when below function is called:

// evaluate the transformer
//
//   - lctx:         llama context
//   - tokens:       new batch of tokens to process
//   - n_past:       the context size so far
//   - n_threads:    number of threads to use
llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)

doesn't the result depend on the state ctx (the NN) is in? it looks to me as the state changes as the program runs. I say this because the function is taking input only the new input from the user or the token it generated, but we know based on the output it generates that it knows what came before (the history).

where is the state stored? is the state the weights? do they change as the program runs?

could you clear this up for me? thanks.

3 replies

KerfuffleV2 Jun 15, 2023
Collaborator

The main example/common part is complicated code designed to handle a bunch of features at the same time. I can't really tell specifically what it does, or what every variable is for. (Not because I don't want to, but I didn't write it and haven't interacted with it much). You'll have to figure that out yourself.

"State" exists in a number of places. Something like keeping track of how many tokens have been evaluated is a type of state. There are also tensors (in the context) for holding the model's actual state (you could call that its short term memory).

don't we also need to clear out ctx?

The point of that code isn't to just completely reset the state, like you'd restarted the program. It's to allow you to generate past the context length with at least a chance of producing some kind of meaningful output.

Let me show you a more concrete example. First: this isn't meant to show exactly what happens (I don't know enough to explain that) but just the general idea. I'll use numbers to indicate prompt tokens and letters to indicate tokens generated by the model.

Let's say you have a prompt: 1 2 3 4 and your context length is 16. You could say the context looks like this initially:

|................|

Then the prompt gets fed to the model:

|1234............|

Then the model starts generating tokens:

|1234ABCD........|

until it reaches the end of available context space:

|1234ABCDEFGHIJKL|

Now what? Well, one way to handle running out of context space is to just stop and exit. Another is to just reset everything, possibly feed in the prompt and start another version of what could be generated from that prompt. (There are ways to do both those things with the main example).

However, --keep N takes a different approach. Let's say you used --keep 4:

|1234IJKL^......|

Again, I just want to be clear I'm trying to show the very, very general idea. Does the prompt get copied to the beginning? The middle? Do the existing tokens get copied to a different place? Maybe. The layout may not be like that diagram.

Anyway, just as an example, you could imagine the next token now gets generated at the ^. So the model has the initial prompt + some of the last tokens in already wrote in its "memory". Generating a token at that point has a reasonable chance of being related to the prompt + previous stuff but it's cut at an arbitrary point. A "token" isn't a word, it may be a part of a word, it may be punctuation.

Is this better than just giving up and exiting or clearing the context to start over? You'll have to play with it to see. I get pretty decent results sometimes.

Answer selected by SlyEcho

aehlke Nov 18, 2023

do you do this only for retaining the original prompt and extending the response, or do you also "slide" through chat history of user/assistant back-and-forth as well?

KerfuffleV2 Nov 18, 2023
Collaborator

@aehlke It looks like the logic for --keep clamps it to the initial prompt. So the answer to the question is it just retains however many tokens from the original prompt, not anything past that (at least as far as I can see from glancing at the code).

siddhsql · 2023-06-16T00:30:44Z

siddhsql
Jun 16, 2023
Author

thanks. this is helpful.

…

On Thu, Jun 15, 2023 at 2:31 PM Kerfuffle ***@***.***> wrote: The main example/common part is complicated code designed to handle a bunch of features at the same time. I can't really tell specifically what it does, or what every variable is for. (Not because I don't want to, but I didn't write it and haven't interacted with it much). You'll have to figure that out yourself. "State" exists in a number of places. Something like keeping track of how many tokens have been evaluated is a type of state. There are also tensors (in the context) for holding the model's actual state (you could call that its short term memory). don't we also need to clear out ctx? The point of that code isn't to just completely reset the state, like you'd restarted the program. It's to allow you to generate past the context length with at least a chance of producing some kind of meaningful output. Let me show you a more concrete example. First: this isn't meant to show *exactly* what happens (I don't know enough to explain that) but just the general idea. I'll use numbers to indicate prompt tokens and letters to indicate tokens generated by the model. Let's say you have a prompt: 1 2 3 4 and your context length is 16. You could say the context looks like this initially: |................| Then the prompt gets fed to the model: |1234............| Then the model starts generating tokens: |1234ABCD........| until it reaches the end of available context space: |1234ABCDEFGHIJKL| Now what? Well, one way to handle running out of context space is to just stop and exit. Another is to just reset *everything*, possibly feed in the prompt and start another version of what could be generated from that prompt. (There are ways to do both those things with the main example). However, --keep N takes a different approach. Let's say you used --keep 4: |1234IJKL^...| Again, I just want to be clear I'm trying to show the very, very general idea. Does the prompt get copied to the beginning? The middle? Do the existing tokens get copied to a different place? Maybe. The layout may not be like that diagram. Anyway, just as an example, you could imagine the next token now gets generated at the ^. So the model has the initial prompt + some of the last tokens in already wrote in its "memory". Generating a token at that point has a reasonable chance of being related to the prompt + previous stuff but it's cut at an arbitrary point. A "token" isn't a word, it may be a part of a word, it may be punctuation. Is this better than just giving up and exiting or clearing the context to start over? You'll have to play with it to see. I get pretty decent results sometimes. — Reply to this email directly, view it on GitHub <#1838 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6NWEK6OC2QAUPZN4HAHRCTXLN5KFANCNFSM6AAAAAAZFDN42I> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

lineality · 2024-02-25T20:50:19Z

lineality
Feb 25, 2024

'''
You'll have to figure that out yourself.
'''

Managing state is absolutely fundimental and part of nearly every sub-part of every use-case....how are these instructions so unclear? A LLM system with no way to manage state is virtually useless...even paradoxical as a concept...why...how??? It's like a radio made with no On/Off switch and saying customers will just have to figure that out....what does that even mean?

1 reply

aehlke Feb 26, 2024

it's not nicely expressed and those things could be provided, but keep in mind there's very little consensus yet on the best ways to manage context. any approach is opinionated and usually simplistic, needing more augmentation/tweaking/experimentation.

siddhsql · 2024-12-18T20:23:48Z

siddhsql
Dec 18, 2024
Author

i am revisiting this issue after a break. are there any updates to llamacpp in this area? back when i used it, this was the blocker that kept me from using it in a real-world application. i think it would be good to add a method to llama_model to just reset the context?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does the prompt context mean? #1838

{{title}}

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What does the prompt context mean? #1838

siddhsql Jun 13, 2023

Replies: 6 comments · 9 replies

KerfuffleV2 Jun 14, 2023 Collaborator

siddhsql Jun 14, 2023 Author

KerfuffleV2 Jun 14, 2023 Collaborator

siddhsql Jun 14, 2023 Author

KerfuffleV2 Jun 14, 2023 Collaborator

siddhsql Jun 14, 2023 Author

KerfuffleV2 Jun 14, 2023 Collaborator

siddhsql Jun 15, 2023 Author

KerfuffleV2 Jun 15, 2023 Collaborator

aehlke Nov 18, 2023

KerfuffleV2 Nov 18, 2023 Collaborator

siddhsql Jun 16, 2023 Author

lineality Feb 25, 2024

aehlke Feb 26, 2024

siddhsql Dec 18, 2024 Author

siddhsql
Jun 13, 2023

Replies: 6 comments 9 replies

KerfuffleV2
Jun 14, 2023
Collaborator

siddhsql Jun 14, 2023
Author

KerfuffleV2 Jun 14, 2023
Collaborator

siddhsql
Jun 14, 2023
Author

KerfuffleV2 Jun 14, 2023
Collaborator

siddhsql Jun 14, 2023
Author

KerfuffleV2 Jun 14, 2023
Collaborator

siddhsql
Jun 15, 2023
Author

KerfuffleV2 Jun 15, 2023
Collaborator

KerfuffleV2 Nov 18, 2023
Collaborator

siddhsql
Jun 16, 2023
Author

lineality
Feb 25, 2024

siddhsql
Dec 18, 2024
Author