Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement zero copy writes for the WebSocket writer #9634

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Nov 3, 2024

Implement zero copy writes for the WebSocket writer
closes #9633

python/cpython#91166 made writelines zero copy for Python 3.12+. For older versions it will still join the bytes

Would like another set of 👀 on this to make sure it makes sense.

@bdraco bdraco added backport-3.10 backport-3.11 Trigger automatic backporting to the 3.11 release branch by Patchback robot labels Nov 3, 2024
Copy link

codspeed-hq bot commented Nov 3, 2024

CodSpeed Performance Report

Merging #9634 will degrade performances by 13.08%

Comparing zeroconf_writes (6fbb4ff) with master (27e23fb)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 40 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark master zeroconf_writes Change
test_one_thousand_large_round_trip_websocket_text_messages[pyloop] 23.5 ms 27 ms -13.08%
test_send_one_hundred_websocket_text_messages_with_mask[pyloop] 807.3 µs 746.2 µs +8.18%

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Nov 3, 2024
Copy link

codecov bot commented Nov 3, 2024

❌ 6 Tests Failed:

Tests completed Failed Passed Skipped
3447 6 3441 16
View the top 3 failed tests by shortest run time
tests.test_client_ws_functional test_heartbeat_connection_closed[pyloop]
Stack Traces | 0.53s run time
No failure message available
tests.test_client_ws_functional test_heartbeat[pyloop]
Stack Traces | 10.1s run time
No failure message available
tests.test_client_ws_functional test_heartbeat_no_pong_concurrent_receive[pyloop]
Stack Traces | 10.2s run time
No failure message available

To view more test analytics, go to the Test Analytics Dashboard
Got feedback? Let us know on Github

@bdraco bdraco changed the title Implement zerocopy writes for the WebSocket writer Implement zero copy writes for the WebSocket writer Nov 3, 2024
@bdraco
Copy link
Member Author

bdraco commented Nov 3, 2024

Benchmarks barely show any improvement. I think the message payload would have to be rather large for it to matter.

We don't have any benchmarks for that

@bdraco bdraco marked this pull request as ready for review November 3, 2024 04:07
@bdraco bdraco marked this pull request as draft November 3, 2024 06:13
@bdraco
Copy link
Member Author

bdraco commented Nov 3, 2024

Still probably not a good benchmark. Will need to extend the round trip WebSocket benchmark to have another one that sends large messages to really get an idea of how it will perform

@bdraco
Copy link
Member Author

bdraco commented Nov 3, 2024

I think part of the reason it’s slower is the protocol gets paused. Everything backs up against the drain waiter in the writer anyways so maybe we need a queue that doesn’t pause writes in two places

Nevermind. I was misremembering. It’s reads that get paused which makes more sense

@bdraco
Copy link
Member Author

bdraco commented Nov 3, 2024

Anyways. Let’s see what the profile looks like with now with all the memcpy reduction optimizations

@bdraco
Copy link
Member Author

bdraco commented Nov 3, 2024

We still end up with more fragmentation somehow.

Probably need to work on the reader some more

@bdraco bdraco closed this Nov 3, 2024
@bdraco bdraco deleted the zeroconf_writes branch November 3, 2024 16:35
@bdraco bdraco restored the zeroconf_writes branch November 5, 2024 05:24
@bdraco bdraco reopened this Nov 5, 2024
@bdraco
Copy link
Member Author

bdraco commented Nov 5, 2024

need to look at this with syscalls

@bdraco bdraco closed this Nov 5, 2024
@bdraco bdraco deleted the zeroconf_writes branch November 11, 2024 16:49
@bdraco bdraco restored the zeroconf_writes branch November 18, 2024 16:56
@bdraco bdraco reopened this Nov 18, 2024
@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

I got this working so it was a net win for the StreamWriter in #9839

I think we need better logic to handle larger payloads only with writelines.

Running it again now so I can look at the flame graphs once codspeed finishes as it will likely give a better idea on where the breakpoint is

@bdraco bdraco added backport-3.12 Trigger automatic backporting to the 3.12 release branch by Patchback robot and removed backport-3.10 labels Nov 18, 2024
@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

The websocket data queue is very small 65k so as soon as we get a large message, we always pause the reader

@asvetlov
Copy link
Member

For now, +-10% difference doesn't look very impressive.

@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

For now, +-10% difference doesn't look very impressive.

no it definitely doesn't

@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

also codspeed doesn't show syscalls since they can't be consistently instrumented so anything that does syscalls has to be checked manually in the codspeed UI since it can be a significant improvement or significant degradation.

@asvetlov
Copy link
Member

@bdraco if you are looking for possible optimizations, replacing https://docs.python.org/3/library/asyncio-protocol.html#asyncio.Protocol with https://docs.python.org/3/library/asyncio-protocol.html#asyncio.BufferedProtocol could be an interesting experiment.
In nutshell, the idea is:
The buffered protocol could use at least two preallocated buffers.
Read from socket into buffer 1, parse buffer 2. After parsing switch them.
It removes data allocation on reading and at least in theory could work well with zero-copy on writing.

@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

That could help quite a bit if they are small. Not sure if it will help if they are not small though since we have to combine them at some point to wrap it in WSMessage.

Still most messages are small so it should save one memcpy on most messages

@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

Oh but we still have to do one slice to split of the headers and message body.....hmmm

@asvetlov
Copy link
Member

No, I'm talking about a little different situation. Memcpy stays as is, but preallocated membuffers could eliminate alloc/free calls.
Now data_received() allocates bytes, files it with data from the socket, and returns bytes back.
Aiohttp handles it, copies and de-allocates given bytestring.
Buffered protocol could help to avoid these alloc/free.
I don't expect a very big difference though.

@bdraco
Copy link
Member Author

bdraco commented Nov 18, 2024

No, I'm talking about a little different situation. Memcpy stays as is, but preallocated membuffers could eliminate alloc/free calls. Now data_received() allocates bytes, files it with data from the socket, and returns bytes back. Aiohttp handles it, copies and de-allocates given bytestring. Buffered protocol could help to avoid these alloc/free.
Got it. I was so focused on the memcpy in the profile that wasn't think about this case.

I don't expect a very big difference though.

Yeah I expect it won't make much difference.

Any idea on how to avoid the copy at

cdef bytes body = at[:length]
?

@asvetlov
Copy link
Member

I'm far from my laptop right now, I cannot watch at generated Cython code.
But maybe memoryview could help? Memoryview slicing doesn't perform memcpy but creates a new view which references the same buffer but with different offsets.

@bdraco
Copy link
Member Author

bdraco commented Nov 27, 2024

diff --git a/aiohttp/_http_parser.pyx b/aiohttp/_http_parser.pyx
index 30767dae6..804f3f8b5 100644
--- a/aiohttp/_http_parser.pyx
+++ b/aiohttp/_http_parser.pyx
@@ -11,6 +11,7 @@ from cpython cimport (
     PyBytes_AsStringAndSize,
     PyObject_GetBuffer,
 )
+from cpython.memoryview cimport PyMemoryView_FromMemory
 from cpython.mem cimport PyMem_Free, PyMem_Malloc
 from libc.limits cimport ULLONG_MAX
 from libc.string cimport memcpy
@@ -763,7 +764,7 @@ cdef int cb_on_headers_complete(cparser.llhttp_t* parser) except -1:
 cdef int cb_on_body(cparser.llhttp_t* parser,
                     const char *at, size_t length) except -1:
     cdef HttpParser pyparser = <HttpParser>parser.data
-    cdef bytes body = at[:length]
+    body = PyMemoryView_FromMemory(<char*>at, length, 0);
     try:
         pyparser._payload.feed_data(body)
     except BaseException as underlying_exc:

This kinda works but only if we don't release the buffer.

Ideally we find a way for this to defer releasing the buffer until all the memory views are consumed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-3.11 Trigger automatic backporting to the 3.11 release branch by Patchback robot backport-3.12 Trigger automatic backporting to the 3.12 release branch by Patchback robot bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into zero copy writes for the websocket writer
2 participants