safekeeper: optimize WAL decoding #10097

erikgrinaker · 2024-12-11T17:00:53Z

After moving WAL decoding to the Safekeeper side in #9746, we're seeing Safekeeper CPUs pinned at 100%, effectively bottlenecked on WAL decoding. CPU profiles have shown this to primarily be checksums/hashing and (de)serialization costs. We should try to optimize and/or parallelize this.

erikgrinaker · 2024-12-16T12:23:01Z

WAL record CRC32 verification takes 18% of CPU time:

neon/libs/postgres_ffi/src/waldecoder_handler.rs

Lines 227 to 228 in a4397d4

    
           crc = crc32c_append(crc, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]); 
        
           crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);

This already uses a hardware-accelerated SIMD implementation (SSE 4.2). However, we only get parallelism with chunks of 3x8192 bytes or 3x256 bytes. We're making two calls here: the 20-byte header (not parallelized) and the actual WAL record. We can try to append or parallelize these.

Furthermore, only about 2% of this happens when receiving WAL, while 16% happens when sending -- we're likely sending across an 8-shard tenant in this case (ingest benchmark), so we're verifying the same record 8 times. This will be addressed by #9337.

erikgrinaker · 2024-12-16T14:28:45Z

We can try to append or parallelize these.

Unsurprisingly, copying the data into a combined slice and running crc32 on that ends up being more expensive except for trivially small records:

complete_record/size=64 time:   [17.642 ns 17.687 ns 17.735 ns]
                        thrpt:  [3.3608 GiB/s 3.3700 GiB/s 3.3786 GiB/s]
                 change:
                        time:   [-3.5556% -3.1691% -2.7656%] (p = 0.00 < 0.05)
                        thrpt:  [+2.8442% +3.2728% +3.6867%]

complete_record/size=1024
                        time:   [74.571 ns 74.945 ns 75.451 ns]
                        thrpt:  [12.640 GiB/s 12.725 GiB/s 12.789 GiB/s]
                 change:
                        time:   [+17.029% +18.603% +20.786%] (p = 0.00 < 0.05)
                        thrpt:  [-17.209% -15.685% -14.551%]

complete_record/size=8192
                        time:   [495.81 ns 499.42 ns 503.51 ns]
                        thrpt:  [15.152 GiB/s 15.276 GiB/s 15.388 GiB/s]
                 change:
                        time:   [+28.747% +29.461% +30.268%] (p = 0.00 < 0.05)
                        thrpt:  [-23.235% -22.757% -22.329%]

Another option might be to just do CRC32 on the entire AppendRequest at 128 KB, instead of individual records, and defer record checks to decoding. But it's not clear that we'll gain that much, since we only see 2% CPU usage on CRC32 in this case. Will put CRC32 aside for now.

erikgrinaker · 2024-12-16T15:52:25Z

In ToWireFormat::to_wire() we allocate new encode and compress buffers for every InterpretedWalRecords batch. These buffers are then only immutably borrowed by PostgresBackend::write_message() and discarded. We could reuse these buffers and avoid at least 2 allocations per batch.

We decided to punt this until we've implemented the cursor in #9337. We should make sure we do this then.

…0158) Touches #10097.

erikgrinaker added a/performance Area: relates to performance of the system c/storage/safekeeper Component: storage: safekeeper labels Dec 11, 2024

erikgrinaker self-assigned this Dec 11, 2024

erikgrinaker mentioned this issue Dec 11, 2024

Epic: optimize WAL ingest pipeline #9624

Open

erikgrinaker mentioned this issue Dec 16, 2024

postgres_ffi: add WalStreamDecoder::complete_record() benchmark #10158

Merged

github-merge-queue bot pushed a commit that referenced this issue Dec 17, 2024

postgres_ffi: add WalStreamDecoder::complete_record() benchmark (#1…

b0e43c2

…0158) Touches #10097.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safekeeper: optimize WAL decoding #10097

safekeeper: optimize WAL decoding #10097

erikgrinaker commented Dec 11, 2024

erikgrinaker commented Dec 16, 2024 •

edited

Loading

erikgrinaker commented Dec 16, 2024

erikgrinaker commented Dec 16, 2024 •

edited

Loading

safekeeper: optimize WAL decoding #10097

safekeeper: optimize WAL decoding #10097

Comments

erikgrinaker commented Dec 11, 2024

erikgrinaker commented Dec 16, 2024 • edited Loading

erikgrinaker commented Dec 16, 2024

erikgrinaker commented Dec 16, 2024 • edited Loading

erikgrinaker commented Dec 16, 2024 •

edited

Loading

erikgrinaker commented Dec 16, 2024 •

edited

Loading