Suggestion: use AVX2 in FastParseCommand #813

mgravell · 2024-11-18T21:24:58Z

Feature request type

Performance enhancement in command identification

Is your feature request related to a problem? Please describe

The [current implementation] (

garnet/libs/server/Resp/Parser/RespCommand.cs

Line 594 in ec54e3e

private RespCommand FastParseCommand(out int count)

) splits by length then matches the last 8 bytes against a set of known values. This works, but is a lot of branches - effectively it is a bunch of:

if (lastChunk == SOME_NUMBER) return foo;
if (lastChunk == SOME_OTHER_NUMBER) return bar;
// ...

Describe the solution you'd like

It turns out that AVX2 is good at this - at least, if we limit ourselves to 4-byte chunks, which is a pretty good filter, especially if we ignore the trailing \r\n. The basic approach here is discussed here, and is fundamentally:

load known values vector (8 x 32-bit)
load and broadcast test operand _mm256_set1_epi32
compare for equality _mm256_cmpeq_epi32
pull the MSB of each 32-bit result _mm256_movemask_ps into a single value (8 effective bits)
find the index of the set bit, if any tzcnt

It would still need a final "were we right?" check, like the existing code does if it doesn't fit in just 8 bytes.

I did a quick demo, here, which so far only handles 8 commands. Obviously it is very incomplete, just to indicate "this is what is available". I also did a test against using static readonly uint values instead of Read<uint> each time, to look at all options (that didn't move the needle any, or is possibly worse somehow, or is just jitter).

Tests here include:

one value from the start of the test list (PING)
one value from the end of the test list (INCR)
one value that isn't in the test list (JUNK)

Results:

| Method            | Command | Mean      | Error     | StdDev    | Ratio | RatioSD |
|------------------ |-------- |----------:|----------:|----------:|------:|--------:|
| ComputePerCall    | INCR    | 1.0559 ns | 0.2056 ns | 0.0113 ns |  1.00 |    0.01 |
| PreComputedSwitch | INCR    | 1.0576 ns | 0.1848 ns | 0.0101 ns |  1.00 |    0.01 |
| PreComputedAvx2   | INCR    | 0.2417 ns | 0.0299 ns | 0.0016 ns |  0.23 |    0.00 |
|                   |         |           |           |           |       |         |
| ComputePerCall    | JUNK    | 1.0449 ns | 0.1534 ns | 0.0084 ns |  1.00 |    0.01 |
| PreComputedSwitch | JUNK    | 1.0495 ns | 0.5420 ns | 0.0297 ns |  1.00 |    0.03 |
| PreComputedAvx2   | JUNK    | 0.2480 ns | 0.0986 ns | 0.0054 ns |  0.24 |    0.00 |
|                   |         |           |           |           |       |         |
| ComputePerCall    | PING    | 0.3143 ns | 0.0095 ns | 0.0005 ns |  1.00 |    0.00 |
| PreComputedSwitch | PING    | 0.2359 ns | 0.0422 ns | 0.0023 ns |  0.75 |    0.01 |
| PreComputedAvx2   | PING    | 0.2422 ns | 0.0975 ns | 0.0053 ns |  0.77 |    0.01 |

The data shows that the existing switch code has time that extends with the number of tests, where-as the AVX2 approach is consistent and fast, giving basically constant time, rather than time that is the number of missed cases. I would expect this to grow proportional to the number of AVX2 tests performed, but it should still be faster.

Describe alternatives you've considered

I have not considered AVX512; in theory this would allow either 16 32-bit tests, or 8 64-bit tests, at a time - but AVX512 is still pretty under-served. It may not be worth the additional effort.

Additional context

The biggest question I can think of here is: which 4 bytes to test? the first 4? or the last 4? someone could probably do a collision analysis to see which narrows it down more. There are collisions either way (LPUSH vs BRPOPLPUSH, INCR vs INCRBY, FLUSHDB vs FLUSHALL, SSCAN vs SCAN vs HSCAN etc). My guess is that the first four bytes give more entropy, but heck: if you switch on length first, you could always do 2 AVX2 tests for collision scenarios, i.e. test the first four and the last four. Or something.

The text was updated successfully, but these errors were encountered:

mgravell · 2024-11-18T21:39:26Z

(borked the test; closing while I fix that, in case this is a nothing-burger)

mgravell · 2024-11-18T21:43:49Z

@badrishc I closed this while I fixed the test/data (now done); feel free to reopen if it is of interest (I don't have access for that).

vazois · 2024-11-19T22:17:54Z

This is a very cool idea! We have actually been thinking about doing this for a while, but we never got around to do it. The numbers look promising, I assume the speedup will be much greater for commands at the bottom of the switch statement. The bigger question: what is the expected end to end gain? So, for example, what is the speedup if we process a batch of commands with parameters instead of individual commands headers? We can use BDN for this.

badrishc · 2024-11-19T22:40:53Z

@mgravell - this is indeed interesting but we need to see if this helps end-to-end. There are BDN benchmarks for Garnet operations at https://github.com/microsoft/garnet/tree/main/benchmark/BDN.benchmark/Operations -- you might want to prototype against that to check that there is an opportunity here.

mgravell · 2024-11-19T22:50:22Z

K. I actually had an idea to reduce the effort required in this; the short version is "write an AVX2-optimized version of IndexOf( ReadOnlySpan<unit>, uint), then just switch on that.

…

On Tue, 19 Nov 2024, 22:41 Badrish Chandramouli, ***@***.***> wrote: @mgravell <https://github.com/mgravell> - this is indeed interesting but we need to see if this helps end-to-end. There are BDN benchmarks for Garnet operations at https://github.com/microsoft/garnet/tree/main/benchmark/BDN.benchmark/Operations -- you might want to prototype against that to check that there is an opportunity here. — Reply to this email directly, view it on GitHub <#813 (comment)> or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMGKMOUDQ7VXCKUMGAD2BO5CRBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3DQMJTG4ZDQNZRQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGY3DSOJUGYYDEOFHORZGSZ3HMVZKM5LQMRQXIZI> . You are receiving this email because you were mentioned. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> .

mgravell · 2024-11-20T15:27:24Z

I did some more digging this morning; it turns out that if you're testing multiple values, ReadOnlySpan<uint> etc already do AVX-based searching when possible, so: you have the option of simply pre-calculating a static readonly uint[] or similar from the partial matches (I would suggest first-four or last-four), and you can do a very effective search on that. If you've already identified either the command-length or the argument-count, you could also further split on that, i.e. by having a separate uint[] for commands of different lengths. This would then allow, for example:

ReadOnlySpan<uint> candidates = commandLength switch {
    3 => __CommandPrecomputed_3, // note: for 3, include trailing \n from preamble, so GET is "\nGET"u8
    4 => __CommandPrecomputed_4,
    5 => __CommandPrecomputed_5,
    6 => __CommandPrecomputed_6,
    _ => [],
};
var index = candidates.IndexOf(ComputeUInt32FromLastFourBytesOfCommand());
if (index >= 0) // recognized pattern, fast-path
{
    switch (commandLength)
    {
        // ...
        case 4:
            switch (index) // fully validated
            {
                case 0: // TIME ....
                case 1: // SREM....
                case 2: // SADD....
                case 3: // MGET ....
                case 4: // SCAN ....
                // ...
            }
        case 5:
            switch (index) // need to check leading 1 byte
            {
                // 0 == ?CARD
                case 0 when Starts("S"\u8): // SCARD...
                case 0 when Starts("Z"\u8): // ZCARD...
                // 1 == ?PUSH
                case 1 when Starts("L"\u8): // LPUSH...
                case 1 when Starts("R"\u8): // RPUSH...
                // ...
            }
    }
}

The key thing here being: a switch over 0, 1, 2, 3, ... etc becomes a simple jump operation - very fast (with final validation in the when); a switch on either large precomputed constants, or where all the "real" testing is in the when is basically a giant list of if branches.

badrishc · 2024-11-21T23:25:24Z

All of this sounds good and potentially is a huge win. But it does seem like a non-trivial amount of dev effort to translate the idea into Garnet's parser. Would this be of interest to you or anyone else in the community to actually try out in Garnet code? The starting point would likely be https://github.com/microsoft/garnet/blob/main/libs/server/Resp/Parser/RespCommand.cs#L594

mgravell · 2024-11-26T17:25:44Z

I will see what I can do with this shortly - hopefully I should have some useful numbers based on the real code (where "useful" can include "no meaningful impact, let's do nothing") this side of 2025.

mgravell closed this as completed Nov 18, 2024

badrishc reopened this Nov 18, 2024

darrenge added the enhancement New feature or request label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: use AVX2 in FastParseCommand #813

Suggestion: use AVX2 in FastParseCommand #813

mgravell commented Nov 18, 2024 •

edited

Loading

mgravell commented Nov 18, 2024

mgravell commented Nov 18, 2024

vazois commented Nov 19, 2024

badrishc commented Nov 19, 2024 •

edited

Loading

mgravell commented Nov 19, 2024 via email

mgravell commented Nov 20, 2024 •

edited

Loading

badrishc commented Nov 21, 2024 •

edited

Loading

mgravell commented Nov 26, 2024

Suggestion: use AVX2 in FastParseCommand #813

Suggestion: use AVX2 in FastParseCommand #813

Comments

mgravell commented Nov 18, 2024 • edited Loading

Feature request type

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

mgravell commented Nov 18, 2024

mgravell commented Nov 18, 2024

vazois commented Nov 19, 2024

badrishc commented Nov 19, 2024 • edited Loading

mgravell commented Nov 19, 2024 via email

mgravell commented Nov 20, 2024 • edited Loading

badrishc commented Nov 21, 2024 • edited Loading

mgravell commented Nov 26, 2024

mgravell commented Nov 18, 2024 •

edited

Loading

badrishc commented Nov 19, 2024 •

edited

Loading

mgravell commented Nov 20, 2024 •

edited

Loading

badrishc commented Nov 21, 2024 •

edited

Loading