Regarding the issue of starting services in multithreading #23094

MenGuangwen0411 · 2024-12-12T20:02:02Z

Describe the issue

How to use onnxruntime with CUDAExecutionProvider in multithreading of python server.

To reproduce

from waitress import serve
from flask import Flask
app = Flask(name)
session = onnxruntime.InferenceSession("best.onnx",providers=['CUDAExecutionProvider'])
@app.route('/')
def infer_model():
......
t1 = time.time()
outputs = session.run(["output"], { "input": img })[0]
t2 = time.time()
ts = t2-t1
print(ts)
......
if name == 'main':
serve(app, host='0.0.0.0', port=8080) # case1

I use Multi-threading post methold outside, ts ranges between 0.02 sec and 0.4 sec,but I set serve as follow:

if name == 'main':
serve(app, host='0.0.0.0', port=8080,threasds = 1) # case 2

ts is approximate 0.025 sec and very smooth.
How to get a smooth and fast result from #case1 as #case 2?

Urgency

No response

Platform

Windows

OS Version

win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

MenGuangwen0411 added the performance issues related to performance regressions label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the issue of starting services in multithreading #23094

Regarding the issue of starting services in multithreading #23094

MenGuangwen0411 commented Dec 12, 2024

Regarding the issue of starting services in multithreading #23094

Regarding the issue of starting services in multithreading #23094

Comments

MenGuangwen0411 commented Dec 12, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?