-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage on Windows #153
Comments
Breaking in a debugger stops at https://github.com/Tom94/tev/blob/master/src/main.cpp#L363 if that's of any help. Unfortunately I'm not being able to build v1.17 myself as it complains about:
So git bisect grinds to a halt :/ |
Ouch, I must have force pushed onto the submodule at some point… good lesson to never do that again in the future. Sorry about breaking bisect. In any case: thanks for reporting this! I’m currently abroad without access to a Windows machine, so it’ll be a couple of days until I can have a closer look. Breaking on the sleep call seems correct FWIW (and should mean that that thread isn’t responsible for the 100% CPU usage… unless something is seriously wrong with sleep_for) |
For future reference: I did manage to get ProfilingProfiling the CPU usage on multiple commits always returns this as being the hot path: It seems to point to code that hasn't changed in tev for 4 years, and also the underlying nanogui replicates that behaviour with the old and new version. I see some locking calls in the constructor of istringstream, possibly connected to locales, just like here or here. Still puzzled what's the cause for this and why it would occur so sporadically. |
Wow, thanks a bunch for the detailed investigation! My best guess is a regression in the MSVC runtime and/or STL that wasn't there yet when I built the v1.17 release. In any case -- I'm not sure we need to figure this out to solve the problem. Does it go away when you replace the
If so, I wouldn't mind updating tev's nanogui fork with this change/fix. The latext tev already requires C++20 to compile, after all. |
Hmmm... I'm not sure whether ditching The timing issue with |
Unfortunately I'm unable to reproduce the problem on my Windows machine. Regardless, in the hopes that rounding is to be blamed I experimentally changed the sleep interval to be based on an integer amount of microseconds, rather than floating point seconds. Could you try the above branch ( |
I'm experiencing the exact same issue, and unfortunately For me the issue only happened over IPC when the process that sends data over the TCP socket is the same one that started tev. If I start tev myself and then connect to it over TCP everything works fine. Adding a delay between spawning tev and sending the CreateImage command also turns out to fix the issue, so there is probably a race condition there. |
That's a good datapoint; thanks for sharing! Would it be possible to get a minimal reproducing example (e.g. single cpp source file w/o dependencies) that I can compile locally and debug with? |
@KarelPeeters I'm somewhat happy to hear that the problem did non only affect me. This bug was driving me crazy and I started to question many things. Thanks for also looking into it. @Tom94 On my end, building the Unfortunately I can't provide a minimal example from my end (way to many dependencies). But if you eg. build something with Python I can offer to test if it does trigger the bug on my system. |
I can reproduce the issue from both python and rust by starting tev and then immediately connecting to it, for example in Python: import subprocess
import time
import numpy as np
from ipc import TevIpc
print("Starting tev")
subprocess.Popen("cmake-build-release\\tev.exe")
# Adding this solves the issue
# time.sleep(1)
print("Connecting to tev")
with TevIpc() as tev:
channels = ["R", "G", "B"]
tev.create_image("test", 32, 32, channels)
tev.update_image("test", np.random.rand(8, 8, 3), channels)
# keep this process alive
while True:
time.sleep(1000) This reliably causes tev to get stuck with a full thread worth of CPU usage, with not even the initially empty image shown. I'm using the release build of I can investigate some more later today if necessary, but first I have to figure out how to get this to happen inside of a debugger or profiler. |
Did some quick tests with @KarelPeeters script and summarize my results with the table below. For me, adding the delay does not change any behaviour, which is consistent with my earlier findings that it is unrelated to IPC (at least on my machine).
Interestingly enough, I redownloaded the @KarelPeeters When testing your script, I did notice sometimes that a svhost.exe (responsible for inter-machine network communication) was causing a one core CPU load instead of tev.exe. But I couldn't pinpoint when exactly it happened and when not. Are you using Resource Monitor to monitor CPU usage and looking at the exact process list? If you were to just look at Task Manager CPU graph, you might be able to mistake 'your' (IPC) bug with 'my' 'sleep_for' bug. We will know more once you can also reproduce it inside a debugger. |
I can also not reproduce the issue, both on I hit the merge button in any case, but will re-open the issue since things don't seem to be resolved on @KarelPeeters end. |
I (was) experiencing the same issue as @KarelPeeters, but it seems to not happen anymore on version 1.25 (Windows release binary). Specifically, as long as a tev process was started, which was immediately followed by establishing of TCP connection and sending of CreateImage command, it resulted in the observed issue (no command(s) received, one core being stuck running at 100% utilization). If a delay was applied after establishing of TCP connection and before sending of CreateImage command, everything worked fine. Therefore the observable issue is that it was possible to establish a TCP connection with tev and send a CreateImage command before tev was "ready" to receive it. That is also why the delay applied in @KarelPeeters example above (that is, delay after startup of the tev before establishing of TCP connection) has the same effect - it results in additional time for tev to initialize whatever it needs to receive commands via TCP. I was able to replicate this issue on Windows release executable versions 1.18 to 1.24. Versions 1.17 and 1.25 work fine. After a bit of digging with my colleague, we think that the cause may be Windows SDK, as the Ipc implementation did not fundamentally change over time (and it - among other things - interacts with the filesystem). The version of Windows SDK was changed in the CI explicitly between versions 1.17 and 1.18, and it could? have changed again between 1.24 and 1.25 under the hood of the current windows-2022-based CI build. |
Errr... I can't even participate in the discussion because Tev refuses to start on my end. $ tev spectrogram.png
SUCCESS Initialized IPC, listening on 127.0.0.1:14158
INFO Launching with 8 bits of color and LDR display support.
SUCCESS Loaded C:\spectrogram.png via STBI after 0.078 seconds.
GLFW error 65543: WGL: A forward compatible OpenGL context requested but WGL_ARB_create_context is unavailable
ERROR Uncaught exception: Could not create an OpenGL 3.2 context! It seems you should mention hardware requirements to meet. |
It would be helpful to know which hardware and OS / driver versions you experience this issue on. |
@Tom94, Core 2 Duo T7250 2GHz, 4 GB RAM (with 384 MB reserved for Intel 965 graphics, including OpenGL 2.0), Windows 7 x64. |
I recently updated my tev binary (Windows, pre-built) from
v1.17
tov1.19
and experience a consistent high CPU load of tev (1 thread, 100%). This happens even if I just start the application in idle with no images loaded yet.I think it is connected to the IPC communication as I observed some interesting behaviour there:
Background
I have an app that starts the tev process, connects to it over TCP, sends a
CreateImage
, then anUpdateImage
packet. I can also repeat this process on the click of a button.Steps
When I open an image the first time, the tev instance opens correctly, but doesn't show the image. Tev runs with 100% CPU load on one thread. Even if I reopen the image (different name, so new
CreateImage
,UpdateImage
combo) from my GUI it doesn't show either. If I close the tev window, the process stays there, with 100% CPU load on one thread.If I kill the tev process, and click on the "reopen image" button again, a second instance opens, and shows the image correctly on the first go.
Thoughts
I'm inclined to rule out my side of the IPC implementation, as it works fine with
v1.17
(every time) andv1.19
(the second time). Even if this is unrelated to IPC, the high CPU load is still indicating a blocking call (possibly some mutex) somewhere.PS: It doesn't matter if the instance is the primary or a secondary. The CPU load is high in both cases.
The text was updated successfully, but these errors were encountered: