Streaming output using OpenAI Python library with llama.cpp HTTP Server in Jupyter Notebook #6264

qymab · 2024-03-23T22:11:32Z

qymab
Mar 23, 2024

Hi,

I'm using the OpenAI Python library with the llama.cpp HTTP Server and Jupyter Notebook. I'm trying to stream the output from the API response instead of receiving the full output at once. However, I'm having trouble getting the streaming to work properly.

Here's the code I'm using:

import openai

client = openai.OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="sk-no-key-required"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "Your task is to summarize"},
        {"role": "user", "content": "In Northern Chile, a pick-up truck bumps along dusty old mining roads toward the Atacama Desert. A team of scientists is driving from the coastal town of Antofagasta, and they occasionally pass other vehicles on the road -mostly prospectors searching for metals and minerals. After an hour, they arrive at a lonely meteorological station situated in the driest part of a very dry desert"}
    ],
    temperature=0.7,
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.get("content", "")
    print(content, end="", flush=True)

print()

I've set stream=True in the client.chat.completions.create() method, but the output is not being streamed as expected. Instead, it seems to wait for the entire response before printing it.

I'm looking for any references, examples, or guidance on how to properly implement streaming with the OpenAI Python library when using the llama.cpp HTTP Server. I want to be able to display the generated text in real-time as it is being produced by the API.

Any help or insights would be greatly appreciated. Thank you!

phymbert · 2024-03-23T22:33:24Z

phymbert
Mar 23, 2024
Collaborator

Hi,

there are both asyncio and openai examples in : https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests

Generated tokens will be received after all prompt tokens are processed. Please check the figures in the usage response field.

Also, if you need additional help, please share the command and the model you use to start the server.

6 replies

phymbert Mar 24, 2024
Collaborator

llama.cpp/examples/server/tests/features/steps/steps.py

Line 843 in 94d1b3b

openai.api_key = user_api_key

qymab Mar 24, 2024
Author

Still I'm not sure what I'm seeing.

phymbert Mar 24, 2024
Collaborator

You are not sure about what ? This is how to request the server with the openai client. How can we help you further?

adfnekc Dec 14, 2024

The test case you provided only demonstrates that the output of llama.cpp conforms to the OpenAI stream format ~~, but the returned data is delivered all at once, rather than being returned token by token.~~

adfnekc Dec 16, 2024

I conducted an in-depth study of this issue and found that the problem lies in my use of the gateway proxy llama.cpp's HTTP server. The gateway automatically applied HTTP gzip compression, which required the Python client to fully receive the entire HTTP payload before it could decompress it. By simply adding "accept-encoding: identity" to the HTTP headers in the Python client, I can achieve streaming output. Maybe @qymab should check your HTTP request using Wireshark or tcpdump to see what happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming output using OpenAI Python library with llama.cpp HTTP Server in Jupyter Notebook #6264

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Streaming output using OpenAI Python library with llama.cpp HTTP Server in Jupyter Notebook #6264

qymab Mar 23, 2024

Replies: 1 comment · 6 replies

phymbert Mar 23, 2024 Collaborator

phymbert Mar 24, 2024 Collaborator

qymab Mar 24, 2024 Author

phymbert Mar 24, 2024 Collaborator

adfnekc Dec 14, 2024

adfnekc Dec 16, 2024

qymab
Mar 23, 2024

Replies: 1 comment 6 replies

phymbert
Mar 23, 2024
Collaborator

phymbert Mar 24, 2024
Collaborator

qymab Mar 24, 2024
Author

phymbert Mar 24, 2024
Collaborator