CPU fp32 to CUDA fp16/bf16 Cast Op Best Practices #21372
-
Hello! Java CUDA ORT model graph surgery intern here. My goal is to find the best way to adapt any half (fp16, bf16) ONN CUDA graph for execution within a Java environment that doesn't support half computations at all (or only with ugly ShortBuffer/ByteBuffer hacks). I want the GPU to have float (fp32) i/o but retain half internal processing. I know there is a script for that, but it currently produces corrupt protos so I'm not going to use it. I have a SDXL UNet with inputs that goes to many nodes' inputs. Since, the |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Can you modify the consumers so they accept the output from the Cast op? The same output can be reused by many different ops as an input. Also in Java 20 there are efficient fp32 <-> fp16 conversions which have been incorporated into ONNX Runtime, so if you want to work in |
Beta Was this translation helpful? Give feedback.
-
Hi @Craigacp, Thanks for your help! I tried the Cast op with many outputs. Will now try to use the same output as input to many nodes. I will report here on how it went. Hi! I'm using Java 22 and I have tried to the |
Beta Was this translation helpful? Give feedback.
-
I confirm that it seems to work, although, I'm not sure if I'm doing it correctly as when I load the model,I get these warnings:
I'm not sure if these are due to my ps: onnxruntime is a joy to work with in a JVM context. Along with the Lucene and Cassandra ecosystems, its the perfect infrastructure for a highly-available IR pipeline. Keep up the good work, whoever is maintaining this. |
Beta Was this translation helpful? Give feedback.
Can you modify the consumers so they accept the output from the Cast op? The same output can be reused by many different ops as an input.
Also in Java 20 there are efficient fp32 <-> fp16 conversions which have been incorporated into ONNX Runtime, so if you want to work in
FloatBuffer
and have ONNX use fp16 tensors you can do that.