Distribute cache concurrency issue #83

Caskia · 2019-06-20T08:00:37Z

Hi @stefanprodan ,
When deploying a lots of api services(use distributed cache to store counter) behind load balance, every instance need to get conuter and use process' lock to increse count, as the code below

 using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
            {
                var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                if (entry.HasValue)
                {
                    // entry has not expired
                    if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                    {
                        // increment request count
                        var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                        // deep copy
                        counter = new RateLimitCounter
                        {
                            Timestamp = entry.Value.Timestamp,
                            Count = totalCount
                        };
                    }
                }

                // stores: id (string) - timestamp (datetime) - total_requests (long)
                await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
            }

I think it has concurrency issues, counter will not work correctly when a lots of requests come in. Every instance's counter count by themself then save to the cache store, the last one will cover the previous one.

Why not use Token Bucket algorithm to achieve this feature.

The text was updated successfully, but these errors were encountered:

cristipufu · 2019-07-27T14:02:44Z

Hi @Caskia,

Your observation is correct. How would you implement the Token Bucket algorithm in this particular case?

Please feel free to contribute with details, pseudo-code or open up a new pull request.

Thanks

liguobao · 2019-09-05T12:43:41Z

the error :

[2019-09-05T12:08:12.6390870+00:00  Microsoft.AspNetCore.Server.Kestrel ERR] Connection id "0HLPFG8HQJTUG", Request id "0HLPFG8HQJTUG:00000003": An unhandled exception was thrown by the application.
System.OperationCanceledException: The operation was canceled.
   at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at Microsoft.Extensions.Caching.Redis.RedisCache.RefreshAsync(String key, Nullable`1 absExpr, Nullable`1 sldExpr, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAndRefreshAsync(String key, Boolean getData, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAsync(String key, CancellationToken token)
   at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.GetStringAsync(IDistributedCache cache, String key, CancellationToken token)
   at AspNetCoreRateLimit.DistributedCacheRateLimitStore`1.GetAsync(String id, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.IpRateLimitProcessor.GetMatchingRulesAsync(ClientRequestIdentity identity, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.RateLimitMiddleware`1.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

from it ?

Thanks for report.

liguobao · 2019-09-08T01:14:30Z

@cristipufu hi, check the error message please ?

cristipufu · 2019-09-10T18:19:38Z

There's a workaround for the distributed cache issue, but requires a little bit of coding (be aware that this is only a concept and it's not tested):

Overwrite the rate limiting middleware part:

DistributedClientRateLimitMiddleware .cs

public static IApplicationBuilder UseDistributedClientRateLimiting(this IApplicationBuilder builder)
{
  return builder.UseMiddleware<DistributedClientRateLimitMiddleware>();
}

public class DistributedClientRateLimitMiddleware : RateLimitMiddleware<DistributedClientRateLimitProcessor>
{
  private readonly ILogger<ClientRateLimitMiddleware> _logger;

  public DistributedClientRateLimitMiddleware(RequestDelegate next,
          IOptions<ClientRateLimitOptions> options,
          IRateLimitCounterStore counterStore,
          IClientPolicyStore policyStore,
          IRateLimitConfiguration config,
          ILogger<ClientRateLimitMiddleware> logger)
      : base(next, options?.Value, new DistributedClientRateLimitProcessor(options?.Value, counterStore, policyStore, config), config)
  {
      _logger = logger;
  }

  protected override void LogBlockedRequest(HttpContext httpContext, ClientRequestIdentity identity, RateLimitCounter counter, RateLimitRule rule)
  {
      _logger.LogInformation($"Request {identity.HttpVerb}:{identity.Path} from ClientId {identity.ClientId} has been blocked, quota {rule.Limit}/{rule.Period} exceeded by {counter.Count}. Blocked by rule {rule.Endpoint}, TraceIdentifier {httpContext.TraceIdentifier}.");
  }
}

Now, we just have to overwrite the ClientRateLimitProcessor.ProcessRequestAsync() method, replacing the AsyncLock in-process lock with a Redis distributed lock:

DistributedClientRateLimitProcessor.cs

public class DistributedClientRateLimitProcessor : ClientRateLimitProcessor
    {
        private readonly IRateLimitCounterStore _counterStore;
        private readonly IRateLimitConfiguration _config;
        private readonly RedLockFactory _redLockFactory;

        public DistributedClientRateLimitProcessor(ClientRateLimitOptions options,
            IRateLimitCounterStore counterStore,
            IClientPolicyStore policyStore,
            IRateLimitConfiguration config,
            IRedLockFactory redLockFactory)
            : base(options, counterStore, policyStore, config)
        {
            _counterStore = counterStore;
            _config = config;
            _redLockFactory = redLockFactory; // this should be somehow singleton injected
        }

        public override async Task<RateLimitCounter> ProcessRequestAsync(ClientRequestIdentity requestIdentity, RateLimitRule rule, CancellationToken cancellationToken = default)
        {
            var counter = new RateLimitCounter
            {
                Timestamp = DateTime.UtcNow,
                Count = 1
            };

            var counterId = BuildCounterKey(requestIdentity, rule);
            var expiry = TimeSpan.FromSeconds(2);

            // serial reads and writes on same key
            using (var redLock = await _redLockFactory.CreateLockAsync(counterId, expiry)) // there are also non async Create() methods
            {
                // make sure we got the lock
                if (redLock.IsAcquired)
                {
                    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                    if (entry.HasValue)
                    {
                        // entry has not expired
                        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                        {
                            // increment request count
                            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                            // deep copy
                            counter = new RateLimitCounter
                            {
                                Timestamp = entry.Value.Timestamp,
                                Count = totalCount
                            };
                        }
                    }

                    // stores: id (string) - timestamp (datetime) - total_requests (long)
                    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
                }
            }
            // the lock is automatically released at the end of the using block

            return counter;
        }
    }

One can implement a distributed lock mechanism in Redis, and there already are some implementations out there that we could use (or at least inspire from):

Then we just have to replace the IRateLimitCounterStore implementation with the distributed one (that's available via the AspNetCoreRateLimiter package) and also specify which distributed cache implementation we want to use:

services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();

services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = "localhost";
    options.InstanceName = "SampleInstance";
});

liguobao · 2019-09-12T01:28:03Z

 services.AddDistributedRedisCache(options =>
                {
                    // e.g. contoso5.redis.cache.windows.net,abortConnect=false,ssl=true,password=...
                    options.Configuration = Configuration["IpRateLimitingRedisString"];
                    options.InstanceName = "ip_rate_limit_";
                });

                // needed to store rate limit counters and ip rules
                services.AddMemoryCache();

                //load general configuration from appsettings.json
                services.Configure<IpRateLimitOptions>(Configuration.GetSection("IpRateLimiting"));

                //load ip rules from appsettings.json
                services.Configure<IpRateLimitPolicies>(Configuration.GetSection("IpRateLimitPolicies"));

                // inject counter and rules stores
                services.AddSingleton<IIpPolicyStore, DistributedCacheIpPolicyStore>();
                services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();

@cristipufu like it?

But it not work, distribute cache concurrent problem also.

cristipufu · 2019-09-13T08:50:42Z

There's also another possible solution (also overriding the ClientRateLimitProcessor.ProcessRequestAsync() method), one could implement the pattern described here: https://redis.io/commands/INCR#pattern-rate-limiter

STRATZ-Ken · 2019-09-13T12:29:38Z

Just wanted to throw in some more information, I run this plugin on an API that was getting about 1 million requests per 4 hours. After getting complaints during high peak times about sluggishness, I removed this limiter and it went down from about 0.6sec per request to 0.15sec per request. I used a Redis backend for Cache as well. So there is defiantly something that needs to be done to help performance under high loads. Otherwise, for small projects this plugin totally rocks.

cristipufu · 2019-09-13T12:47:43Z

@STRATZ-Ken were you using the 3+ version? (the previous versions had a simple lock which means that each request would have throttled all the others)

The issue probably is this block of code:

using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
{
    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

    if (entry.HasValue)
    {
        // entry has not expired
        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
        {
            // increment request count
            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

            // deep copy
            counter = new RateLimitCounter
            {
                Timestamp = entry.Value.Timestamp,
                Count = totalCount
            };
        }
    }

    // stores: id (string) - timestamp (datetime) - total_requests (long)
    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
}

As you can see this will hold a lock per ip/client/etc (depending on the plugin's configuration) that makes 2 Redis calls inside it. So if one would implement this atomic operation directly in Redis, replacing the lock + the 2 calls with a single Redis call, I think that the performance would improve a lot.

liguobao · 2019-09-15T07:41:19Z

@stefanprodan

STRATZ-Ken · 2019-12-26T22:22:37Z

Any update on this?

remoba · 2020-01-30T11:39:10Z

Any plans to fix this?

Daniel-Glucksman · 2020-02-25T08:25:57Z

Any plans to fix this?

simonhaines · 2020-03-05T04:31:25Z

The IDistributedCache service used to implement the distributed rate-limiting counter does not provide enough concurrency guarantees to resolve this race condition, and it likely never will. An atomic increment operation is needed, such as Redis' INCR command.

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache, see the repo here. This also reduces per-request latency by eliminating the read/update/write operations that are the core of this issue (see here).

For each rate limit rule, time is divided into intervals that are the length of the rule's period. Requests are resolved to an interval, and the interval's counter is incremented. This is a slight change in semantics to the original implementation, but works in our use-case.

This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

olegkv · 2020-03-09T16:46:43Z

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache
This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between. Looks like the only way to make it work correctly is to make it atomic using Lua script, exactly as explained in Redis documentation for INCR command.

simonhaines · 2020-03-09T17:53:07Z

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between.

No race condition, it does not matter what the counter value is when the key expiry is set. The expiry only needs to be set once (but can be set any number of times), and the check for a counter value of 1 ensures this.

This is why the algorithm divides time into intervals of the rule's period. The expiry time will be the same for each interval, and any client may set it, as long as it is set at least once.

olegkv · 2020-03-09T20:29:38Z

No race condition

Whatever is non-atomic (and your code is non-atomic) will produce race conditions.
All these cases can happen:
-your code gets frozen by garbage collector after creating counter and before setting expiration, so counter expires too late
-your code can crash after creating counter and before setting expiration because of OS shutdown or some other reason, so counter never expires
-network issue can prevent setting expiration after creating counter so counter never expires or expires too late

..and so on.

That's why Redis documentation says it needs to be done by Lua script which is atomic, Redis guys wrote it for a reason.

simonhaines · 2020-03-10T20:24:33Z

@olegkv the right place to discuss the fork is over at its repo, so I've created an issue there for this.

nick-cromwell · 2020-04-12T14:41:54Z

I’ve only recently found this repo and was wondering if there are known instances of using this in a production environment and how successful it has been.

Presumably the side effect of this issue is an incorrect request count. I think we can get by with that within a certain range.

Are there any other consequences of this that we should know?

simonhaines · 2020-04-14T00:11:43Z

@nick-cromwell I maintain a fork whose master branch is used in production. It has a known issue in a distributed deployment where one node may set a Redis key and then leave the network (e.g. crash) before it sets the key to expire. In this case, access to the rate-limited resource is eventually blocked. This is an extremely rare case, but is serious enough that it has been addressed in the atomic-lua branch, which is being tested and has not yet been used in production.

nick-cromwell · 2020-04-15T12:52:58Z

Thanks for that, @simonhaines. I'll keep a look out for a pull request from the atomic-lua branch. I'm not familiar with Lua script, but based on your discussion here with @olegkv it seems that will resolve this issue once and for all.

nick-cromwell · 2020-05-14T18:03:51Z

@simonhaines Are you good to make a PR from your atomic-lua branch? I think that would close out this issue, correct?

simonhaines · 2020-05-15T04:28:12Z

@nick-cromwell I don't think so. The fork changes the IRateLimitCounterStore interface in an incompatible way. Merging would require a major version number increase. It also makes the StackExchange Redis client a hard dependency. I'm not sure which way @stefanprodan wants to take this library, but these are big changes.

tomaustin700 · 2020-06-13T13:40:58Z

Hey @stefanprodan, what do you think to the changes proposed by @simonhaines? I'm wanting to use this package but holding off until the concurrency issues are sorted.

validide · 2021-04-07T17:47:10Z

I was reading an recent article about how GitHub solved their rate limiting issues link with the actual Lua script at the end.

Does anyone consider this as a valid approach that would be worth looking into an possibly creating a pull request?

simonhaines · 2021-04-08T03:01:17Z

@validide there is a pull request #180 that incorporates distributed rate-limiting and it looks like a good way forward. However the proposed changes require an increase to the major version, and it looks like the project structure is going to be refactored with the distributed rate limited split out to a separate project.

ByronAP · 2021-04-20T18:42:17Z

Wish it would get merged and v bumped already though... it's been sitting a while

ozonni · 2021-05-21T15:46:31Z

this should have high priority since for high load projects this issue is a blocker...

validide · 2021-05-21T16:00:51Z

I realize this is an open-source project and everyone is contributing on their own time so is there any way I could help?

nick-cromwell · 2021-05-21T16:12:09Z

Thanks @validide. We're just waiting on #180 to be merged and the major version to be increased. Once #180 is merged I can open a PR for the StackExchangeRedisProcessingStrategy which resolves this.

cristipufu · 2021-05-27T06:46:55Z

We've pushed a new version 4.0.0 https://www.nuget.org/packages/AspNetCoreRateLimit/4.0.0 compatible with
the Redis extension which is available on NuGet here https://www.nuget.org/packages/AspNetCoreRateLimit.Redis/

bkromhout · 2022-02-24T20:09:07Z

@cristipufu, sorry to revive this, but wanted to ask something. I've read through all of this thread, and the related PRs, and the general documentation, but something is still unclear to me: is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy? Or was this a problem unique to Redis?

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

Apologies again for reviving this just to ask, but I'm trying to understand the risk factor here of using SQL Server so that I can make a recommendation to my team.

cristipufu · 2022-03-12T11:37:01Z

@bkromhout

is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy

Yes

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

I think that you can follow the Redis processing strategy example and implement something similar in SQL

cristipufu · 2022-10-31T08:56:52Z

Built-in Rate Limiting support - part of .NET 7

jamesfoster · 2023-05-09T09:55:37Z

https://github.com/stefanprodan/AspNetCoreRateLimit/wiki/Using-Redis-as-a-distributed-counter-store#solution-2---in-memory-cache-fallback

Can the docs be updated to provide an example of the in-memory fallback for the "New Version". Thanks.

cristipufu added bug enhancement labels Jun 20, 2019

cristipufu pinned this issue Sep 15, 2019

cristipufu changed the title ~~Distribute cache concurrent problem and why not use Token Bucket algorithm?~~ Distribute cache concurrency issue Sep 15, 2019

cristipufu mentioned this issue Mar 7, 2020

change dictionary #110

Closed

simonhaines mentioned this issue Mar 10, 2020

Atomic increment and expire statusas/AspNetCoreRateLimit#1

Closed

Turnerj mentioned this issue Mar 26, 2020

Support for a Counter-caching API TurnerSoftware/CacheTower#32

Closed

EminemJK mentioned this issue May 22, 2020

The operation was canceled #126

Closed

nick-cromwell mentioned this issue Dec 18, 2020

Allow a Processing Strategy to be injected and configured via new startup extension methods #180

Merged

cristipufu closed this as completed in #180 May 25, 2021

BalintBanyasz mentioned this issue Sep 19, 2021

Redis Cache Rate Limiting to InMemory FallBack Not Working #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute cache concurrency issue #83

Distribute cache concurrency issue #83

Caskia commented Jun 20, 2019 •

edited

Loading

cristipufu commented Jul 27, 2019

liguobao commented Sep 5, 2019 •

edited

Loading

liguobao commented Sep 8, 2019

cristipufu commented Sep 10, 2019

liguobao commented Sep 12, 2019

cristipufu commented Sep 13, 2019

STRATZ-Ken commented Sep 13, 2019

cristipufu commented Sep 13, 2019 •

edited

Loading

liguobao commented Sep 15, 2019

STRATZ-Ken commented Dec 26, 2019

remoba commented Jan 30, 2020

Daniel-Glucksman commented Feb 25, 2020

simonhaines commented Mar 5, 2020

olegkv commented Mar 9, 2020

simonhaines commented Mar 9, 2020

olegkv commented Mar 9, 2020

simonhaines commented Mar 10, 2020

nick-cromwell commented Apr 12, 2020

simonhaines commented Apr 14, 2020

nick-cromwell commented Apr 15, 2020

nick-cromwell commented May 14, 2020

simonhaines commented May 15, 2020

tomaustin700 commented Jun 13, 2020

validide commented Apr 7, 2021

simonhaines commented Apr 8, 2021

ByronAP commented Apr 20, 2021

ozonni commented May 21, 2021

validide commented May 21, 2021

nick-cromwell commented May 21, 2021

cristipufu commented May 27, 2021

bkromhout commented Feb 24, 2022

cristipufu commented Mar 12, 2022

cristipufu commented Oct 31, 2022 •

edited

Loading

jamesfoster commented May 9, 2023

Distribute cache concurrency issue #83

Distribute cache concurrency issue #83

Comments

Caskia commented Jun 20, 2019 • edited Loading

cristipufu commented Jul 27, 2019

liguobao commented Sep 5, 2019 • edited Loading

liguobao commented Sep 8, 2019

cristipufu commented Sep 10, 2019

liguobao commented Sep 12, 2019

cristipufu commented Sep 13, 2019

STRATZ-Ken commented Sep 13, 2019

cristipufu commented Sep 13, 2019 • edited Loading

liguobao commented Sep 15, 2019

STRATZ-Ken commented Dec 26, 2019

remoba commented Jan 30, 2020

Daniel-Glucksman commented Feb 25, 2020

simonhaines commented Mar 5, 2020

olegkv commented Mar 9, 2020

simonhaines commented Mar 9, 2020

olegkv commented Mar 9, 2020

simonhaines commented Mar 10, 2020

nick-cromwell commented Apr 12, 2020

simonhaines commented Apr 14, 2020

nick-cromwell commented Apr 15, 2020

nick-cromwell commented May 14, 2020

simonhaines commented May 15, 2020

tomaustin700 commented Jun 13, 2020

validide commented Apr 7, 2021

simonhaines commented Apr 8, 2021

ByronAP commented Apr 20, 2021

ozonni commented May 21, 2021

validide commented May 21, 2021

nick-cromwell commented May 21, 2021

cristipufu commented May 27, 2021

bkromhout commented Feb 24, 2022

cristipufu commented Mar 12, 2022

cristipufu commented Oct 31, 2022 • edited Loading

jamesfoster commented May 9, 2023

Caskia commented Jun 20, 2019 •

edited

Loading

liguobao commented Sep 5, 2019 •

edited

Loading

cristipufu commented Sep 13, 2019 •

edited

Loading

cristipufu commented Oct 31, 2022 •

edited

Loading