-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance roadmap #2632
Comments
Merged
I've removed the prototype (as we already have developed https://github.com/CliMA/MultiBroadcastFusion.jl, which has performance tests) to reduce the noise in this issue. I'm pleasantly surprised that the generic/recursive pattern appears (somehow) more performant than the hard-coded one, but I'll take it! |
Really nice and helpful. Thank you! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue is a continuation of #635, but I'm excluding some items (some addressed, and others which I've explained in #635) to reduce the noise.
Memory access patterns
We should make sure that we inline all kernels, use shared/local memory when possible, and ensure we have coalesced reads/writes.
Reducing loads and stores
The primary point of improving performance beyond our current state is by reducing the number of memory loads and stores. One way to do that is by fusing operations, which can allow the compiler to hoist (and eliminate) memory loads/stores. Another way is to explicitly pass less data through broadcast expressions (where possible).
There are a few different options / paths to capturing some of this performance that we've left on the table, and each approach has its limitations, pros and cons:
@fuse begin @. a = b; @. c = d end
)we could end up with the same number of loads and stores if we had only performed optimization 1) or 2) alone.
Removing unnecessary work
We can remove unnecessary work, e.g., in precomputed quantities, or using a caching system
Parallelism
There are other optimizations we can perform, which can also have a notable impact. For example, parallelizing work, reducing allocations to reduce the frequency of GC, reducing MPI communication, and emitting more efficient low-level code. Below is a list of some of these items:
Scaling
Minimize number of dss calls, and gc calls.
Misc
There are other miscilaneous items, specified in the task list.
Tasks
f!
andj!
computation ClimaTimeSteppers.jl#233T_exp!
andT_lim!
ClimaTimeSteppers.jl#247set_precomputed_quantities!
ClimaTimeSteppers.jl#270The text was updated successfully, but these errors were encountered: