Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: optimize WAL decoding #10097

Open
Tracked by #9624
erikgrinaker opened this issue Dec 11, 2024 · 3 comments
Open
Tracked by #9624

safekeeper: optimize WAL decoding #10097

erikgrinaker opened this issue Dec 11, 2024 · 3 comments
Assignees
Labels
a/performance Area: relates to performance of the system c/storage/safekeeper Component: storage: safekeeper

Comments

@erikgrinaker
Copy link
Contributor

After moving WAL decoding to the Safekeeper side in #9746, we're seeing Safekeeper CPUs pinned at 100%, effectively bottlenecked on WAL decoding. CPU profiles have shown this to primarily be checksums/hashing and (de)serialization costs. We should try to optimize and/or parallelize this.

@erikgrinaker erikgrinaker added a/performance Area: relates to performance of the system c/storage/safekeeper Component: storage: safekeeper labels Dec 11, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 11, 2024
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 16, 2024

WAL record CRC32 verification takes 18% of CPU time:

crc = crc32c_append(crc, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);
crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);

This already uses a hardware-accelerated SIMD implementation (SSE 4.2). However, we only get parallelism with chunks of 3x8192 bytes or 3x256 bytes. We're making two calls here: the 20-byte header (not parallelized) and the actual WAL record. We can try to append or parallelize these.

Furthermore, only about 2% of this happens when receiving WAL, while 16% happens when sending -- we're likely sending across an 8-shard tenant in this case (ingest benchmark), so we're verifying the same record 8 times. This will be addressed by #9337.

@erikgrinaker
Copy link
Contributor Author

We can try to append or parallelize these.

Unsurprisingly, copying the data into a combined slice and running crc32 on that ends up being more expensive except for trivially small records:

complete_record/size=64 time:   [17.642 ns 17.687 ns 17.735 ns]
                        thrpt:  [3.3608 GiB/s 3.3700 GiB/s 3.3786 GiB/s]
                 change:
                        time:   [-3.5556% -3.1691% -2.7656%] (p = 0.00 < 0.05)
                        thrpt:  [+2.8442% +3.2728% +3.6867%]

complete_record/size=1024
                        time:   [74.571 ns 74.945 ns 75.451 ns]
                        thrpt:  [12.640 GiB/s 12.725 GiB/s 12.789 GiB/s]
                 change:
                        time:   [+17.029% +18.603% +20.786%] (p = 0.00 < 0.05)
                        thrpt:  [-17.209% -15.685% -14.551%]

complete_record/size=8192
                        time:   [495.81 ns 499.42 ns 503.51 ns]
                        thrpt:  [15.152 GiB/s 15.276 GiB/s 15.388 GiB/s]
                 change:
                        time:   [+28.747% +29.461% +30.268%] (p = 0.00 < 0.05)
                        thrpt:  [-23.235% -22.757% -22.329%]

Another option might be to just do CRC32 on the entire AppendRequest at 128 KB, instead of individual records, and defer record checks to decoding. But it's not clear that we'll gain that much, since we only see 2% CPU usage on CRC32 in this case. Will put CRC32 aside for now.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 16, 2024

In ToWireFormat::to_wire() we allocate new encode and compress buffers for every InterpretedWalRecords batch. These buffers are then only immutably borrowed by PostgresBackend::write_message() and discarded. We could reuse these buffers and avoid at least 2 allocations per batch.

We decided to punt this until we've implemented the cursor in #9337. We should make sure we do this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
Development

No branches or pull requests

1 participant