Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

designing and testing throttles #651

Open
arnoFleming opened this issue Jan 23, 2024 · 8 comments
Open

designing and testing throttles #651

arnoFleming opened this issue Jan 23, 2024 · 8 comments

Comments

@arnoFleming
Copy link

I have been (unknowingly) running into the issue described in this comment. The problem is that I am trying to implement (and test the correctness of) a layered set of throttles.

The test is giving many (seemingly) random failures, and (after reading the explanation linked above), I clearly understand why that is.

My current situation is that I have a 'baseline' throttle (250 requests per 10 minutes), and a
'bursty' throttle (let's 100 requests per 10 seconds).

Given that X requests in Y period is somewhat hard impossible to guarantee, what approach would you take at designing and testing a system that has these hard limits?

@arnoFleming
Copy link
Author

arnoFleming commented Jan 23, 2024

After a few hours I see this may come across as 'what will you do to solve my problem'. That is definitely not what I was aiming for.

Let me rephrase:

I have a honest question how I can build a system that throttles X requests in Y timeframe (literally no more, no less), and I am genuinely interested to know what approach I should take to accomplish that using rack attack in my application.

The docs don't mention how to do this.

If it's impossible, I think the community would benefit from clear stating in the docs that the 'boxes' with the size of 'duration' most of the time do not coincide with the first throttle entry landing. Stating that this interval start time is out of any-ones control, may even be the best approach.

@arnoFleming
Copy link
Author

arnoFleming commented Jan 23, 2024

Implementation:

# config/initializers/rack_attack.rb

# some uninteresting parts omitted 

Rack::Attack.throttle("pdf/baseline/throttle/user_id", limit: 80, period: 8.hours) do |request|
  request.user_id if request.pdf_download_route?
end

the integration test:

# test/system/pdf_downloads_are_throttled_test.rb

  # some uninteresting parts omitted 

  test "pfd downloads are throttled to 80 requests per 8 hours" do
    start_time = Time.now
    manuscript = create(:manuscript, :with_pdf)

    80.times do
      travel 6.minutes 

      get("/manuscripts/#{manuscript.id}/download")
      assert_response 200
    end

    get("/manuscripts/#{manuscript.id}/download")
    # this assertion almost never passes
    assert_response 429

    travel_to start_time + 8.hours - 1.minute

    get("/manuscripts/#{manuscript.id}/download")
    # this assertion almost never passes
    assert_response 429

    travel_to start_time + 8.hours + 1.second
    get("/manuscripts/#{manuscript.id}/download")
    assert_response 200
  end

@santib
Copy link
Collaborator

santib commented Jan 28, 2024

Hi @arnoFleming, you are right about the behavior of rack attack. It's all about how the cache key is formed. Some time ago, when I searched back to understand why it was like this (I was also surprised by it), I realized it has always been, right since the gem was created.

I think the community would benefit from clear stating in the docs that the 'boxes' with the size of 'duration' most of the time do not coincide with the first throttle entry landing. Stating that this interval start time is out of any-ones control, may even be the best approach.

I agree with this. Would you like to make a PR improving the docs?

EDIT: regarding your tests, you can make them work by freezing the start time to the beginning of a "block"/period.


On a sidenote, I think it might be possible to change this by removing the period stuff from the cache key, and depend only on the expires_in (set only the first time the key is added). But a change like this would be extremely risky so:

  1. It shouldn't change the default behavior
  2. It should be a configuration
  3. It would require lots of testing
  4. We'd need to evaluate the tradeoff between the benefits it adds against the added complexity to the codebase

Not sure if it was discussed/tried in the past.

@jamiemccarthy
Copy link

Hi 👋 , just stopped in to say this approach is how Rails 8.0 will be implementing throttling:

https://github.com/rails/rails/blob/main/actionpack/lib/action_controller/metal/rate_limiting.rb

Yes, the expires_in: within on the increment sets the cache expiration time if and only if the entry is being created. If the cache entry already exists, the store does not change the expires_at time. I checked this by reading the code for the dalli, redis and memory stores. Rails merged in a test for this behavior two weeks ago, so it's official now. It's not yet described in the RDoc, but I've just pushed a PR to add it.

when I searched back to understand why it was like this (I was also surprised by it), I realized it has always been, right since the gem was created.

I suspect because in years past this set-only-on-cache-entry-creation behavior was not guaranteed (and maybe one or more of the underlying cache stores didn't in fact support it), Rack::Attack didn't want to rely on it. Now that Rails' cache stores support it and there's a test for it, relying on it is a valid choice.

I do think my #578 refactor still somewhat improves the mitigation that throttling offers, both over existing Rack::Attack and over the new RateLimiting feature. I'm thinking of an attacker who coordinates attacks from multiple IPs and tries to nearly-double their effectiveness by sending the maximum number of requests allowed shortly before the end of one window and then shortly after the start of another. This can be done both when windows are shared by many IPs (existing Rack::Attack), and when triggering the start of a window can be done by sending one request (RateLimiting). But randomly-staggered windows cuts the effective power in half.

But, relying on expires_in is a much simpler approach than baking the expiration time into the key. There's something to be said for that.

@arnoFleming
Copy link
Author

I've been thinking about this in the last week. Will probably create an addition to the docs.

Also I'm interested to see if an additional or replacement implementation could work for maintained and users of this library.
I didn't find time for either, yet.

@arnoFleming
Copy link
Author

regarding your tests, you can make them work by freezing the start time to the beginning of a "block"/period.

I probably can. However, it's testing in a way that makes the implementation happy. What I'm aiming for is an implementation that makes the business happy. Freezing the time in the test only proves that, in certain cases, it works as intended. Is like the test to express it works as intended, irrespective of the timing of requests.

@SpamapS
Copy link

SpamapS commented Nov 22, 2024

Curious what you ended up doing @arnoFleming ? I ran into this with our unit tests on Rack Attacks being reported as "flaky" and now I figured out this is actually one of those times where the actual implementation is flaky!

We use Timecop in our unit tests, and this has made the test not flake:

        # Find the beginning of a "block"
        Timecop.freeze(Time.now)
        nao = Time.now.to_f
        block = nao.to_i / 60
        newblock = block
        while newblock == block
          nao += 0.001
          newblock = nao.to_i / 60
        end
        Timecop.freeze(Time.at(nao))

But.. I sort of share your frustration with this approach. It's making the tests pass, but papering over the vulnerability to aggressive abuse.

@arnoFleming
Copy link
Author

Curious what you ended up doing

TL;DR: We settled for something that is good for the business (proper throttles), but fails to be tested as we'd like to (as freezing time is not to my likings, as it gives false assurances).

I did fixed it differently.
If memory serves:

  • I've discussed this with the client. They agreed some sloppiness was acceptable, as long as the upper limit (max amount of allowed requests) was honoured, whatever the sloppiness.
  • We've halved the allowed requests in the throttles, to meet the expected deliverable, even if the windows do not match up.
  • Then updated tests so that at least the expected amount was coming through, and double of the amount was being blocked (I'm most unsure about this point, so take it with a pinch of salt).

The other alternative would've been, but wasn't chosen as we don't like to add daemons if we can keep using the ones we have:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants