CPU fp32 to CUDA fp16/bf16 Cast Op Best Practices #21372

contrebande-labs · 2024-07-16T17:16:25Z

contrebande-labs
Jul 16, 2024

Hello!

Java CUDA ORT model graph surgery intern here. My goal is to find the best way to adapt any half (fp16, bf16) ONN CUDA graph for execution within a Java environment that doesn't support half computations at all (or only with ugly ShortBuffer/ByteBuffer hacks). I want the GPU to have float (fp32) i/o but retain half internal processing. I know there is a script for that, but it currently produces corrupt protos so I'm not going to use it. I have a SDXL UNet with inputs that goes to many nodes' inputs. Since, the Cast Op only support a single output, the only way I found to make this work is to have multiple Casts on the same data for each downstream nodes what requires this input. Surely there is a better way! Do you guys have an idea on how to avoid casting the same data many times? Is a graph merge the best way or is there some sort of "branch" Op to simply have the same data to fork into multiple outputs? Thanks!

Answered by Craigacp

Jul 16, 2024

Can you modify the consumers so they accept the output from the Cast op? The same output can be reused by many different ops as an input.

Also in Java 20 there are efficient fp32 <-> fp16 conversions which have been incorporated into ONNX Runtime, so if you want to work in FloatBuffer and have ONNX use fp16 tensors you can do that.

View full answer

Craigacp · 2024-07-16T18:21:02Z

Craigacp
Jul 16, 2024

Can you modify the consumers so they accept the output from the Cast op? The same output can be reused by many different ops as an input.

Also in Java 20 there are efficient fp32 <-> fp16 conversions which have been incorporated into ONNX Runtime, so if you want to work in FloatBuffer and have ONNX use fp16 tensors you can do that.

0 replies

contrebande-labs · 2024-07-16T20:10:43Z

contrebande-labs
Jul 16, 2024
Author

Hi @Craigacp,

Thanks for your help!

I tried the Cast op with many outputs. Will now try to use the same output as input to many nodes. I will report here on how it went.

Hi! I'm using Java 22 and I have tried to the ShortBuffer/ByteBuffer/Fp16Conversions and got it to work, but I find it very combersome compared to just having the device (in this case, a CUDA GPU) have standard fp32 I/Os and keep the Java host code and memory clean. Of course, it infinitesimally slows down the data transfers, but the GPU is much faster at up/downsampling than the CPU, so in the end, there's no real performance difference. Plus, dealing only in native java primitives brings valuable maintainability and portability. So until Java gets the much-awaited primitive class and Vector API support for half-precision datatatypes, I'm going to stick with this solution for now. I'm open to be wrong on this with a convicing demonstration to the contrary, of course.

1 reply

Craigacp Jul 16, 2024

I agree it'll be much nicer with value types & vector API support, I've been talking to the Java org about numerics for some time (I work in Oracle Labs so it's a little easier for me to talk to them).

The Java 20 methods will compile down to CPU intrinsics when the C2 compiler gets hold of that code, and it should vectorize too (assuming you have the right CPU), but yes, the GPU has a lot more hardware available for doing that kind of transformation.

contrebande-labs · 2024-07-16T21:07:03Z

contrebande-labs
Jul 16, 2024
Author

I confirm that it seems to work, although, I'm not sure if I'm doing it correctly as when I load the model,I get these warnings:

2024-07-16 20:51:10.084962482 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-16 20:51:10.084983717 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

I'm not sure if these are due to my Cast nodes or if they are due to the optimum fp32 to fp16 conversion. Any clue on how I could further tweak the graph to: 1) reduce the Memcpy nodes; and 2) make sure everything is pinned to the CUDA provider? Netron will crash on this model because it's too big (UNet) so I have limited insights unless I know what to look for.

ps: onnxruntime is a joy to work with in a JVM context. Along with the Lucene and Cassandra ecosystems, its the perfect infrastructure for a highly-available IR pipeline. Keep up the good work, whoever is maintaining this.

2 replies

Craigacp Jul 16, 2024

I maintain ORT's Java API, I'm glad you find it helpful.

As for the graph structure questions, I think the ORT core team are better placed to handle those questions. I've looked through pretty big graphs in Netron before, but it does get a bit slow. You can run things like the ONNX model checker to see if it's still happy that the inputs and outputs are wired correctly, but it's not a perfect check.

contrebande-labs Jul 16, 2024
Author

You are? That's great! And it's even better that you are active here!

I will check with the model checker and if I still don't see anything, I will take it to the core team. Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU fp32 to CUDA fp16/bf16 Cast Op Best Practices #21372

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CPU fp32 to CUDA fp16/bf16 Cast Op Best Practices #21372

contrebande-labs Jul 16, 2024

Replies: 3 comments · 3 replies

Craigacp Jul 16, 2024

contrebande-labs Jul 16, 2024 Author

Craigacp Jul 16, 2024

contrebande-labs Jul 16, 2024 Author

Craigacp Jul 16, 2024

contrebande-labs Jul 16, 2024 Author

contrebande-labs
Jul 16, 2024

Replies: 3 comments 3 replies

Craigacp
Jul 16, 2024

contrebande-labs
Jul 16, 2024
Author

contrebande-labs
Jul 16, 2024
Author

contrebande-labs Jul 16, 2024
Author