-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker process stuck in infinite loop when keepalive is enabled #18
Comments
@rohitjoshi can you provide a minimal configuration file and the nginx -V that make this error be reproducible? |
Please see the config below. Unfortunately, we could not reproduce in dev/QA environment but it happened twice in the production. nginx -V:
Upstream contains total 39 entries with each having one server as below and keealive set to 2.
Nginx Config:
|
We were having the same issue with nginx 1.91.5 and nginx-upstream-dynamic-servers resolve module version 0.4.0.
output of nginx -V
config is has 8 upstreams each configured with one server and keepalive of 32.
updating to master from 0.4.0 seems to have the fixed the issue for us. |
@rohitjoshi and @pgokul |
@wandenberg thanks for the patch. Unfortunately, I could not reproduce this issue earlier in dev/qa environment. We observed this issue in prod both the time and had to disable this feature. I will try and see if we can apply this patch and push to prod but might take a week. Should we be using |
The |
I'm not sure if it's related, but we've observed a similar problem in our production environment - after some time one Nginx worker always took 100% of one CPU core and never went back until server restart. This also happens with the latest master (patch 29e05c5 applied) and never happens without this plugin. (It's kind of hard to debug/replicate this issue since it often happened after a couple of hours and we don't know the cause.) |
@wandenberg We are able to reproduce the issue in our QA environment consistently and applying the patch seems to fix the issue for us. We are still testing different scenarios though. As of now it seems fine. |
@JanJakes we have the same similar problem. As you can see in my above post, due to end of link list overwritten by one of the node pointer, it goes into infinite loop. |
@pgokul Can you share how to replicate the problem with us? |
@wandenberg - We were able to replicate the issue by hitting the nginx server continuously with requests for an upstream endpoint which took around 20-30 secs to complete the call. upstream block
nginx conf
|
Just a note - we are not using the |
Any update on this? Thx |
Not yet. I need someone help to reproduce the problem, or at least a working configuration similar with the environment where you are having the problem, including the backend application, since the code you point as the loop problem is only used when the nginx caches the request to the backend. |
I could not reproduce in our QA environment but if we disable keepalive in upstream, we do not see any issue. This module work fine but increased latency by 100ms because it creates connection per request. So it is Nginx caching the connection and this module combined causes this issue. |
@rohitjoshi just to clarify, the module does not create the connections. You would have the same 100ms latency if not using it. By default nginx create a connection to backend each time it has to forward a request. |
Hey all, we are experiencing the same issue, on version 29e05c5 |
@kipras I was able to force the nginx execute this piece of code. This is the first step to reproduce and fix the problem. Give some more days. |
@wandenberg did you manage to reproduce this issue consistently? If so - how? |
The issue not yet. I was able only to make the nginx execute the code @rohitjoshi pointed out. |
Hi @wandenberg, any updates? I just want to make sure we're not too far behind in our troubleshooting efforts if you have any new information. Thanks so much for your work so far. |
@jsoverson I was not able to reproduce the infinite loop reported by @rohitjoshi but I was able to reproduce another situation related with keepalive module where a segfault happens after some changes on DNS address while Nginx receive one request per second. Questions to those who are having this issue of infinite loop:
You can send these directly to me if don't want to share with everybody, but I need more information to reproduce and fix the issue. |
@wandenberg here are some more details. We modified a fake DNS server implemented in Python to randomly choose between two addresses (localhost and a private IP, in our case) and set a very low TTL (1 second). Also:
Request stream was several hundred requests per second, with keepalive set. In this reproduction environment, we were able to see the problem reliably within a few minutes. Some speculation: Let me know if you need more information. |
We have 15 upstream entries with keepalive set to 1. Each server has two IP addresses. Both these IPs are rotated in a round robin fashion every 30sec. Three of these upstream entries have two servers F5 and AWS ELB with round robin based routing. Both F5 and AWS ELB returns two or more IPs. |
@dgmoen this was the scenario that I reproduced, the segfault due to previous_pool cleanup. @rohitjoshi can you modify a little bit you configuration trying to isolate the problem? |
We are not able to reproduce in our Dev/QA environment but looking at the backtrack on a process which was stuck in infinite loop, it was using F5 with multiple IP and keepalive was enabled. |
I think we fixed this issue in our environment, by patching the keepalive module (scroll down for the patch). The thing is that this dynamic servers module reinitializes an upstream, after one of the upstream servers DNS resolution changes. That calls Even though we couldn't reproduce the issue, but i suspect the problem that causes all the possible issues (infinite loops, memory consumption, segfaults - we experienced all of these) is here, in the
My guess is that issues occur when the keepalive reinitialisation happens in between when Needless to say this can cause (and does cause) all kinds of issues. It breaks the new When looking into some broken nginx instances with gdb i saw proof of this: cache and free queues that together contained 27 items, even though the config specified that there has to be 16 items. Also, they clearly came from 3 allocations, when they should all come from the same one. So it's clear that the issue is caused by mixing queue data from different allocations. I also think that the issue is probably more likely to happen when there are frequent DNS changes for some upstream server (or when several servers in the same upstream have dns changes in the span of several seconds) and at the same time some requests take a longer time to respond (say a minute or two). But again, these are just guesses, as i tried many things and situations and could not reproduce this issue properly. Anyway, after patching the keepalive module all the previously observed issues in our production (when dynamic dns resolution was enabled) went away, so i think that fixed it. This is not a perfect solution, it would be better to patch this dynamic servers module, but i'm not sure how to fix it :/ Maybe some of you guys will have an idea. The idea behind the patch is to add a safeguard in the keepalive module - upstream config versioning, by adding an ID property to the upstream server config, that would be increased with every call to Here's the patch for OpenResty nginx 1.9.7.4 in a GitHub gist: Not sure if it will work for your nginx version, and you'll definately need to change nginx directory path there if you're not using OpenResty or using some other version of nginx. If anyone uses it - let me know if it worked for you. And maybe this will give you guys some idea as to how fix nginx-upstream-dynamic-servers module instead (as that would be a better option). |
@kipras Nice analysis. |
@thanks for finding the issue. Does this impact only keepalive module or we need to patch round robin /hash modules as well? |
Rather than patching keep alive, round-robin hash, and other modules, IMO the real fix is to ensure that this module does not mess with the memory allocations. See pull request #21. |
@rohitjoshi, we've tested the fix from @ariya, and it does resolve the 100% CPU issue as well as the segmentation fault we've been experiencing. However, as @wandenberg noted, it does also leak memory. For the time being, we're managing this by periodically reloading the nginx workers. Not an ideal solution, but much better than the currently available alternatives. |
@dgmoen thanks for validating. I have switched to |
I believe line 528 is problematic.
I can come up a contrived scenario: suppose at the moment statement 528 is called, there is at least one active request, say request r1. While the pointer p = By the time the r1 finishes, (assuming for some reason the proxy/upstream connection cannot keep alive, and RR load balancer), function I don't see keep-alive module directly suffers from the invalid memory access by statically reading the code, but it may be an indirectly victim. |
I spot another potential problem over here (it is unlikely take place though)
If error take place, we proceeds as if nothing have happened. |
I realized a new problem after I come up a fix to the segfault problem: keepalive is initialized again and again; each time it's initiaized it simply clobber its cached connections without disconnecting them; overtime it will accumulate far more connections than the "keepalive" specifies. Now I understand why keepalive suffers from memory access as well, looks like people has analyzed before (Apologize being an impatient reader). To reproduce the segfault problem is pretty easy -- it does not need to setup special DNS server.Just change the src code such that it call ``init_upstream |
typo at line 561? I think it make more sense to set the timer interval to
|
As a workaround, I have fixed the issue using running external python script to periodically resolve dns for upstream host and if IP address changes, it reloads nginx. |
Hi all I have what I think is the same issue here. I can reproduce the situation where NGINX uses 100% CPU every time so i'd be happy to try to provide whatever debug info you need. My setup:
My observations:
I'm going to try to change some config and see if i can make any progress. Met me know if i can be any help and i'll feed back if i find anything useful. Cheers |
Just adding some output from
|
I just found that updating to latest |
I've been using the (currently) latest 29e05c5 commit on master and hit this same issue with an upstream that had a very low DNS TTL and changed IP regularly. CPU usage went very high, systemtap showed time was spent in the same call stack shared by others above, and occasional core dumps from segmentation faults also shared the call stack with those above. I was able to somewhat frequently reproduce the issue in isolation by using Building the module with the b3ded6c commit on the |
i think i have solve this problem. |
@dgmoen how about this solution #21 in your project, did you find other problem? how often reload nginx when use this solution can work well? |
When keepalive is enabled along with Nginx-upstream-dynamic-servers resolve module, worker process goes into the infinite loop in ngx_http_upstream_keepalive_module module.
If I disable
nginx-upstream-dynamic-servers
, it issues.nginx version: 1.9.15
Here is code from which it not able to come out from this for loop:
As you can see below, q->next and cache->next is never null.
The text was updated successfully, but these errors were encountered: