What does the prompt context mean? #1838
-
Hello, I would like to understand what does the prompt context mean:
Questions:
sorry if the questions are basic for experts. I am trying to learn the field. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 9 replies
-
You'll probably get an error. Your prompt plus any output the LLM generates has to fit in the context size.
There's a separate parameter for batch size (used for feeding the LLM your prompt, not generating tokens).
This part doesn't really have anything to do with the model. The prompt and output is more like a shared editor, the chat or Q&A part is basically an illusion. The So there isn't a definite answer here, it will depend on what options you are using. |
Beta Was this translation helpful? Give feedback.
-
i don't get it. what's the purpose of this code:
```
// infinite text generation via context swapping
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits
in batches
if (n_past + (int) embd.size() > n_ctx) {
const int n_left = n_past - params.n_keep;
// always keep the first token - BOS
n_past = std::max(1, params.n_keep);
// insert n_left/2 tokens at the start of embd from last_n_tokens
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.
size(), last_n_tokens.end() - embd.size());
// stop saving session if we run out of context
path_session.clear();
printf("\n---\n");
printf("resetting: '");
for (int i = 0; i < (int) embd.size(); i++) {
printf("%s", llama_token_to_str(ctx, embd[i]));
}
printf("'\n");
printf("\n---\n");
}
```
why do we need to do this?
…On Wed, Jun 14, 2023 at 8:20 AM Kerfuffle ***@***.***> wrote:
I have definitely seen chat sessions where the length of chat goes beyond
512 words.
Sorry I wasn't clear. 512 is the default context size, but you can set it
higher. Most LLaMA models are trained at 2048. You *can* set it higher
than that, but generally that's where the output gets completely incoherent.
Also note that setting it higher can use more memory and affect
performance.
The LLM is not aware that we are doing a Q&A?
That's not an easy question to answer because what does "the LLM is/isn't
aware" even mean?
Depending on the model, it may have been trained a specific format like:
### Instruction: Some question.
### Response: Some response.
You may be able to have more than one of those instruction/response pairs
in the context to simulate a conversation. However, the LLM will happily
generate both sides of the exchange if you let it. That's what I meant
about there not being a *real* distinction.
Software like llama.cpp (if configured) can watch for the LLM writing ###
Instruction: and return control to the user at that point so can have a
conversation but that's not really part of the model itself if that makes
any sense.
E.g., if I set prompt size to 512 tokens then the LLM uses the last 512
tokens to decide what next word to predict?
Generally it uses the whole context. There are some special tokens that
can change how the LLM behaves though. For LLaMA models there's a start of
document token, end of document token, etc. If you let the model generate
the EOD token then it may ignore stuff above that point.
The main example also has a --keep N argument which lets you generate
more tokens than the context size. It will copy some tokens from the prompt
back into the context and let generation continue. Of course this is just
taking the full buffer and slapping the prompt on top of it: then the LLM
continues from that state. It may or may not produce something coherent -
partially because a chunk of the context is missing, and partly because it
chopped at a pretty arbitrary place which may be in the middle of a word or
sentence.
—
Reply to this email directly, view it on GitHub
<#1838 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A6NWEKYJGSVLI4GOJKJW6ULXLHJENANCNFSM6AAAAAAZFDN42I>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
thanks. Going back to the code:
don't we also need to clear out
doesn't the result depend on the state where is the state stored? is the state the weights? do they change as the program runs? could you clear this up for me? thanks. |
Beta Was this translation helpful? Give feedback.
-
thanks. this is helpful.
…On Thu, Jun 15, 2023 at 2:31 PM Kerfuffle ***@***.***> wrote:
The main example/common part is complicated code designed to handle a
bunch of features at the same time. I can't really tell specifically what
it does, or what every variable is for. (Not because I don't want to, but I
didn't write it and haven't interacted with it much). You'll have to figure
that out yourself.
"State" exists in a number of places. Something like keeping track of how
many tokens have been evaluated is a type of state. There are also tensors
(in the context) for holding the model's actual state (you could call that
its short term memory).
don't we also need to clear out ctx?
The point of that code isn't to just completely reset the state, like
you'd restarted the program. It's to allow you to generate past the context
length with at least a chance of producing some kind of meaningful output.
Let me show you a more concrete example. First: this isn't meant to show
*exactly* what happens (I don't know enough to explain that) but just the
general idea. I'll use numbers to indicate prompt tokens and letters to
indicate tokens generated by the model.
Let's say you have a prompt: 1 2 3 4 and your context length is 16. You
could say the context looks like this initially:
|................|
Then the prompt gets fed to the model:
|1234............|
Then the model starts generating tokens:
|1234ABCD........|
until it reaches the end of available context space:
|1234ABCDEFGHIJKL|
Now what? Well, one way to handle running out of context space is to just
stop and exit. Another is to just reset *everything*, possibly feed in
the prompt and start another version of what could be generated from that
prompt. (There are ways to do both those things with the main example).
However, --keep N takes a different approach. Let's say you used --keep 4:
|1234IJKL^...|
Again, I just want to be clear I'm trying to show the very, very general
idea. Does the prompt get copied to the beginning? The middle? Do the
existing tokens get copied to a different place? Maybe. The layout may not
be like that diagram.
Anyway, just as an example, you could imagine the next token now gets
generated at the ^. So the model has the initial prompt + some of the
last tokens in already wrote in its "memory". Generating a token at that
point has a reasonable chance of being related to the prompt + previous
stuff but it's cut at an arbitrary point. A "token" isn't a word, it may be
a part of a word, it may be punctuation.
Is this better than just giving up and exiting or clearing the context to
start over? You'll have to play with it to see. I get pretty decent results
sometimes.
—
Reply to this email directly, view it on GitHub
<#1838 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A6NWEK6OC2QAUPZN4HAHRCTXLN5KFANCNFSM6AAAAAAZFDN42I>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
''' Managing state is absolutely fundimental and part of nearly every sub-part of every use-case....how are these instructions so unclear? A LLM system with no way to manage state is virtually useless...even paradoxical as a concept...why...how??? It's like a radio made with no On/Off switch and saying customers will just have to figure that out....what does that even mean? |
Beta Was this translation helpful? Give feedback.
-
i am revisiting this issue after a break. are there any updates to llamacpp in this area? back when i used it, this was the blocker that kept me from using it in a real-world application. i think it would be good to add a method to |
Beta Was this translation helpful? Give feedback.
The
main
example/common part is complicated code designed to handle a bunch of features at the same time. I can't really tell specifically what it does, or what every variable is for. (Not because I don't want to, but I didn't write it and haven't interacted with it much). You'll have to figure that out yourself."State" exists in a number of places. Something like keeping track of how many tokens have been evaluated is a type of state. There are also tensors (in the context) for holding the model's actual state (you could call that its short term memory).
The point of that code isn't to just completely reset the state, like you'd restarted the progr…