Worker process stuck in infinite loop when keepalive is enabled #18

rohitjoshi · 2016-07-06T17:41:21Z

When keepalive is enabled along with Nginx-upstream-dynamic-servers resolve module, worker process goes into the infinite loop in ngx_http_upstream_keepalive_module module.
If I disable nginx-upstream-dynamic-servers, it issues.

nginx version: 1.9.15

#2  0x000000000049161c in ngx_http_upstream_get_keepalive_peer (pc=0x7fa175554690, data=0x7fa175555048)
    at src/http/modules/ngx_http_upstream_keepalive_module.c:239

Here is code from which it not able to come out from this for loop:

for (q = ngx_queue_head(cache);
         q != ngx_queue_sentinel(cache);
         q = ngx_queue_next(q))
    {
        item = ngx_queue_data(q, ngx_http_upstream_keepalive_cache_t, queue);
        c = item->connection;

        if (ngx_memn2cmp((u_char *) &item->sockaddr, (u_char *) pc->sockaddr,
                         item->socklen, pc->socklen)
            == 0)
        {
            ngx_queue_remove(q);
            ngx_queue_insert_head(&kp->conf->free, q);

            goto found;
        }
    }

As you can see below, q->next and cache->next is never null.

#2  0x000000000049161c in ngx_http_upstream_get_keepalive_peer (pc=0x7fa175554690, data=0x7fa175555048)
    at src/http/modules/ngx_http_upstream_keepalive_module.c:239
239 src/http/modules/ngx_http_upstream_keepalive_module.c: No such file or directory.
(gdb) p cache
$30 = (ngx_queue_t *) 0x7fa1b9eff058
(gdb) p q
$31 = (ngx_queue_t *) 0x7fa17555c2a0
(gdb) p cache->prev
$32 = (ngx_queue_t *) 0x7fa175645dd0
(gdb) p cache->next
$33 = (ngx_queue_t *) 0x7fa17555c208
(gdb) p q->prev
$34 = (ngx_queue_t *) 0x7fa17555c208
(gdb) p q->next
$35 = (ngx_queue_t *) 0x7fa1b9f0be20
(gdb) p q->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next
$36 = (ngx_queue_t *) 0x7fa17555c208
(gdb) p cache->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next
$37 = (ngx_queue_t *) 0x7fa17555c2a0
(gdb)

The text was updated successfully, but these errors were encountered:

wandenberg · 2016-07-07T13:17:09Z

@rohitjoshi can you provide a minimal configuration file and the nginx -V that make this error be reproducible?

rohitjoshi · 2016-07-07T19:22:27Z

Please see the config below. Unfortunately, we could not reproduce in dev/QA environment but it happened twice in the production.

nginx -V:

nginx version: openresty/1.9.15.1
built by gcc 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
built with OpenSSL 1.0.1f 6 Jan 2014
TLS SNI support enabled
configure arguments: --prefix=/openresty/nginx --with-cc-opt='-O2 -I/usr/local/include' --add-module=../ngx_devel_kit-0.3.0 --add-module=../echo-nginx-module-0.59 --add-module=../xss-nginx-module-0.05 --add-module=../ngx_coolkit-0.2rc3 --add-module=../set-misc-nginx-module-0.30 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.05 --add-module=../srcache-nginx-module-0.31 --add-module=../ngx_lua-0.10.5 --add-module=../ngx_lua_upstream-0.05 --add-module=../headers-more-nginx-module-0.30 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.17 --add-module=../redis2-nginx-module-0.13 --add-module=../redis-nginx-module-0.3.7 --add-module=../rds-json-nginx-module-0.14 --add-module=../rds-csv-nginx-module-0.07 --with-ld-opt='-Wl,-rpath,/openresty/luajit/lib -L/usr/local/lib' --with-pcre-jit --with-http_realip_module --with-http_stub_status_module --add-module=/tmp/openresty-ngx-1.9.15.1/build/nginx-upstream-dynamic-servers --user=openresty --with-http_ssl_module

Upstream contains total 39 entries with each having one server as below and keealive set to 2.
There is one entry ``xx_cloud_vip` with two hosts.

upstream domain_vip{
   server domain.com:11537 max_fails=0 resolve;
   keepalive 2;
}

upstream xx_cloud_vip
{
    server xx-1.com max_fails=0 resolve weight=19;
    server xx-2.con max_fails=0 resolve weight=1;
    keepalive 2;
}

Nginx Config:

server {
        listen 443 ssl proxy_protocol;
         ssl_verify_client optional;

    location /
   {

          default_type  application/xml;
          proxy_ssl_verify off;
          proxy_ssl_verify_depth 2;
          proxy_ssl_session_reuse on;
          proxy_ssl_ciphers ALL;
          proxy_ssl_protocols SSLv3 TLSv1;

          proxy_read_timeout 10s;
          proxy_connect_timeout 10s;
          proxy_send_timeout 10s;

          proxy_next_upstream error  timeout  invalid_header   http_500  http_502 http_503  http_504  http_403 http_404;
          proxy_http_version 1.1;
          proxy_set_header Host $http_host;
          proxy_set_header X-Real-IP $clientip;
          proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
          proxy_set_header X-Forwarded-Proto https;

         #domain_vip is populated using a lua variable. for simplicity, I have replaced to match upstream entry
          proxy_ssl_name "domain.com";
          proxy_pass "https://domain_vip/test";
        }
}

pgokul · 2016-07-09T07:48:29Z

We were having the same issue with nginx 1.91.5 and nginx-upstream-dynamic-servers resolve module version 0.4.0.
Issue was easily reproducible if we hit the server continuously with moderately expensive api calls(each api call taking around 15 secs to complete)
All worker process were maxing cpu at 100%.
connecting gdb showed a stack trace of

#0  0x0000000000437e7e in ngx_rwlock_wlock ()
#1  0x000000000049e6fd in ngx_http_upstream_free_round_robin_peer ()
#2  0x00000000004eade3 in ngx_http_upstream_free_keepalive_peer ()
#3  0x0000000000498820 in ngx_http_upstream_finalize_request ()
#4  0x0000000000497d14 in ngx_http_upstream_process_request ()
#5  0x0000000000497a52 in ngx_http_upstream_process_upstream ()
#6  0x0000000000491f1a in ngx_http_upstream_handler ()
#7  0x000000000045a8ba in ngx_epoll_process_events ()
#8  0x000000000044a6b5 in ngx_process_events_and_timers ()
#9  0x00000000004583ff in ngx_worker_process_cycle ()
#10 0x0000000000454cfb in ngx_spawn_process ()
#11 0x000000000045723b in ngx_start_worker_processes ()
#12 0x0000000000456858 in ngx_master_process_cycle ()
#13 0x000000000041a519 in main ()

output of nginx -V

nginx version: openresty/1.9.15.1
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
built with OpenSSL 1.0.1e-fips 11 Feb 2013
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-debug --with-cc-opt='-DNGX_LUA_USE_ASSERT -DNGX_LUA_ABORT_AT_PANIC -O2 -O0 -g' --add-module=../ngx_devel_kit-0.3.0 --add-module=../echo-nginx-module-0.59 --add-module=../xss-nginx-module-0.05 --add-module=../ngx_coolkit-0.2rc3 --add-module=../set-misc-nginx-module-0.30 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.05 --add-module=../srcache-nginx-module-0.31 --add-module=../ngx_lua-0.10.5 --add-module=../ngx_lua_upstream-0.05 --add-module=../headers-more-nginx-module-0.30 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.17 --with-ld-opt=-Wl,-rpath,/usr/local/openresty/luajit/lib --pid-path=/var/run/nginx.pid --error-log-path=/var/log/openresty/error.log --http-log-path=/var/log/openresty/access.log --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_geoip_module --with-http_gzip_static_module --with-http_stub_status_module --with-ipv6 --with-pcre-jit --add-module=/home/jenkins/rpmbuild/BUILD/openresty-1.9.15.1/nginx-upstream-dynamic-servers-0.4.0

config is has 8 upstreams each configured with one server and keepalive of 32.

upstream myserver
{
    server myserver resolve;
    keepalive 32;
}

updating to master from 0.4.0 seems to have the fixed the issue for us.

wandenberg · 2016-07-10T00:04:59Z

@rohitjoshi and @pgokul
can you confirm if that the master version 29e05c5de4d solves the problem?
If it does not solve can you help me providing a simple configuration and a script which reproduce the problem?

rohitjoshi · 2016-07-10T03:54:23Z

@wandenberg thanks for the patch. Unfortunately, I could not reproduce this issue earlier in dev/qa environment. We observed this issue in prod both the time and had to disable this feature. I will try and see if we can apply this patch and push to prod but might take a week.

Should we be using ngx_resolve_name_locked instead of ngx_resolve_name?

wandenberg · 2016-07-10T14:11:06Z

The ngx_resolve_name function calls ngx_resolve_name_locked on its internals.
I avoid as much as possible to call specific functions of nginx. As example, ngx_resolve_name does some checks we should do if choose to call ngx_resolve_name_locked directly.

JanJakes · 2016-07-11T06:41:58Z

I'm not sure if it's related, but we've observed a similar problem in our production environment - after some time one Nginx worker always took 100% of one CPU core and never went back until server restart. This also happens with the latest master (patch 29e05c5 applied) and never happens without this plugin. (It's kind of hard to debug/replicate this issue since it often happened after a couple of hours and we don't know the cause.)

pgokul · 2016-07-11T14:00:44Z

@wandenberg We are able to reproduce the issue in our QA environment consistently and applying the patch seems to fix the issue for us. We are still testing different scenarios though. As of now it seems fine.

rohitjoshi · 2016-07-12T01:34:34Z

@JanJakes we have the same similar problem. As you can see in my above post, due to end of link list overwritten by one of the node pointer, it goes into infinite loop.

wandenberg · 2016-07-12T02:08:07Z

@pgokul Can you share how to replicate the problem with us?
I can try to create a test case with it and ensure that the bug will not happens again.

pgokul · 2016-07-12T02:23:03Z

@wandenberg - We were able to replicate the issue by hitting the nginx server continuously with requests for an upstream endpoint which took around 20-30 secs to complete the call.
keepalive is on and proxy_buffering is set to the default value. All requests have no caching.

upstream block

upstream myserver
{
    server myserver resolve;
    keepalive 32;
}

nginx conf

worker_processes  auto;

events {
    worker_connections 66536;
    use                 epoll;
    multi_accept        on;
}
worker_rlimit_nofile    50000;

http {   

    sendfile        on;
    tcp_nopush on;
    tcp_nodelay on;

    gzip  on;
    gzip_comp_level  2;
    gzip_min_length  1000;
    gzip_types *
    gzip_proxied any;

    resolver 127.0.0.1 ipv6=off;
}

JanJakes · 2016-07-12T07:53:42Z

Just a note - we are not using the keepalive directive in upstream but we still get the 100% CPU after some time of traffic. (Update: But we do have keepalive_timeout 65; in the http section.)

rohitjoshi · 2016-07-20T12:38:05Z

Any update on this? Thx

wandenberg · 2016-07-20T13:01:31Z

Not yet. I need someone help to reproduce the problem, or at least a working configuration similar with the environment where you are having the problem, including the backend application, since the code you point as the loop problem is only used when the nginx caches the request to the backend.
I was not able to find out when it do that.

rohitjoshi · 2016-07-20T16:50:27Z

I could not reproduce in our QA environment but if we disable keepalive in upstream, we do not see any issue. This module work fine but increased latency by 100ms because it creates connection per request. So it is Nginx caching the connection and this module combined causes this issue.

wandenberg · 2016-07-20T18:14:56Z

@rohitjoshi just to clarify, the module does not create the connections. You would have the same 100ms latency if not using it. By default nginx create a connection to backend each time it has to forward a request.
Using keepalive surely increase the performance and apparently combining keepalive and the module causes the issue. But as I said only enabling the keepalive does not cause the issue since it does not activate that code. :(

kipras · 2016-08-02T13:14:25Z

Hey all, we are experiencing the same issue, on version 29e05c5

wandenberg · 2016-08-02T13:23:36Z

@kipras I was able to force the nginx execute this piece of code. This is the first step to reproduce and fix the problem. Give some more days.
If you have any other information to add to solve the problem, send to us please.

kipras · 2016-08-02T14:09:06Z

@wandenberg did you manage to reproduce this issue consistently? If so - how?
I would try to help debugging this issue, if i could reproduce it.

wandenberg · 2016-08-02T14:51:03Z

The issue not yet. I was able only to make the nginx execute the code @rohitjoshi pointed out.
As soon as I consistently reproduce the problem I will post here.

jsoverson · 2016-08-05T18:46:13Z

Hi @wandenberg, any updates? I just want to make sure we're not too far behind in our troubleshooting efforts if you have any new information. Thanks so much for your work so far.

wandenberg · 2016-08-10T04:38:23Z

@jsoverson I was not able to reproduce the infinite loop reported by @rohitjoshi but I was able to reproduce another situation related with keepalive module where a segfault happens after some changes on DNS address while Nginx receive one request per second.

Questions to those who are having this issue of infinite loop:

how many servers are in your upstream?
how many IP address each of these servers have?
what is the frequency of IP address change?
can you share the IP address that are in use? (no matter if are private IPs.)

You can send these directly to me if don't want to share with everybody, but I need more information to reproduce and fix the issue.

dgmoen · 2016-08-10T22:26:14Z

@wandenberg here are some more details. We modified a fake DNS server implemented in Python to randomly choose between two addresses (localhost and a private IP, in our case) and set a very low TTL (1 second).

Also:

we have one server in the upstream
we have two IP addresses for these servers (see above)
IP addresses cycle every second or so (randomly choose one or the other)

Request stream was several hundred requests per second, with keepalive set. In this reproduction environment, we were able to see the problem reliably within a few minutes.

Some speculation:
The ngx_http_upstream_dynamic_server_conf_t struct has two pointers, pool and previous_pool. If any connections have references to the upstream data allocated from previous_pool when a new DNS mapping is applied, then these references will be pointing to freed memory. This leads to either a segmentation violation (signal 11) or 100% CPU (attempt to access lock in ngx_http_file_cache_t).

Let me know if you need more information.

rohitjoshi · 2016-08-10T22:33:05Z

We have 15 upstream entries with keepalive set to 1. Each server has two IP addresses. Both these IPs are rotated in a round robin fashion every 30sec.

Three of these upstream entries have two servers F5 and AWS ELB with round robin based routing. Both F5 and AWS ELB returns two or more IPs.

wandenberg · 2016-08-11T01:13:54Z

@dgmoen this was the scenario that I reproduced, the segfault due to previous_pool cleanup.
But the original issue is a infinite loop. This one I couldn't reproduce yet.

@rohitjoshi can you modify a little bit you configuration trying to isolate the problem?
If possible, would be nice let all upstream configurations only with one server.
This way we will know if the problem only happens with multiple servers or if have multiple IPs to the same server is enough to the issue happen.

rohitjoshi · 2016-08-11T03:00:00Z

We are not able to reproduce in our Dev/QA environment but looking at the backtrack on a process which was stuck in infinite loop, it was using F5 with multiple IP and keepalive was enabled.
I believe it is the same use case described by @dgmoen. In our case, F5 returns different IP every 30 sec.

kipras · 2016-08-12T10:17:28Z

I think we fixed this issue in our environment, by patching the keepalive module (scroll down for the patch).

The thing is that this dynamic servers module reinitializes an upstream, after one of the upstream servers DNS resolution changes. That calls ngx_http_upstream_init_keepalive() for upstreams that have "keepalive" enabled. Keepalive initialization then allocates the keepalive connection cache, which is then used by the keepalive module. And after another reinitialization - dynamic servers module frees the previous keepalive cache by freeing the pool from which it was allocated. So far so good.

Even though we couldn't reproduce the issue, but i suspect the problem that causes all the possible issues (infinite loops, memory consumption, segfaults - we experienced all of these) is here, in the ngx_http_upstream_keepalive_close_handler() function:
https://github.com/nginx/nginx/blob/master/src/http/modules/ngx_http_upstream_keepalive_module.c#L419

ngx_queue_remove(&item->queue);
ngx_queue_insert_head(&conf->free, &item->queue);

My guess is that issues occur when the keepalive reinitialisation happens in between when ngx_http_upstream_free_keepalive_peer() and ngx_http_upstream_keepalive_close_handler() are called. ngx_http_upstream_keepalive_close_handler () is passed an event, that contains in its data the keepalive cache item that held the connection being closed. However, conf points to the same upstream config, as the upstream config data structure is not reallocated, just the conf->cache and conf->free queues. So, if the cache queue that it belonged to is already freed - we are now inserting a cache item from previous allocation into a newly allocated free queue (via conf->free).

Needless to say this can cause (and does cause) all kinds of issues. It breaks the new free queue - it now contains more items than it should. Also since this item belongs to a memory pool that will soon be deallocated - once it is deallocated the memory region where this cache item exists can be overwritten with some other data, which will completely break the queue.

When looking into some broken nginx instances with gdb i saw proof of this: cache and free queues that together contained 27 items, even though the config specified that there has to be 16 items. Also, they clearly came from 3 allocations, when they should all come from the same one. So it's clear that the issue is caused by mixing queue data from different allocations.

I also think that the issue is probably more likely to happen when there are frequent DNS changes for some upstream server (or when several servers in the same upstream have dns changes in the span of several seconds) and at the same time some requests take a longer time to respond (say a minute or two). But again, these are just guesses, as i tried many things and situations and could not reproduce this issue properly.

Anyway, after patching the keepalive module all the previously observed issues in our production (when dynamic dns resolution was enabled) went away, so i think that fixed it. This is not a perfect solution, it would be better to patch this dynamic servers module, but i'm not sure how to fix it :/ Maybe some of you guys will have an idea.

The idea behind the patch is to add a safeguard in the keepalive module - upstream config versioning, by adding an ID property to the upstream server config, that would be increased with every call to ngx_http_upstream_init_keepalive(). And each element would also have a copy of that ID.
And then every time the keepalive module tries to make changes to one of those queues - it first checks if the item that is being added/removed has the same cache ID as the current upstream config ID, and if not - it does not mess with the current cache queues.

Here's the patch for OpenResty nginx 1.9.7.4 in a GitHub gist:
https://gist.github.com/kipras/c6e51d9e20bce96070ea6ca2fa5ddb27

Not sure if it will work for your nginx version, and you'll definately need to change nginx directory path there if you're not using OpenResty or using some other version of nginx.

If anyone uses it - let me know if it worked for you. And maybe this will give you guys some idea as to how fix nginx-upstream-dynamic-servers module instead (as that would be a better option).

marcelomd · 2016-08-12T15:07:45Z

@kipras Nice analysis.

rohitjoshi · 2016-08-12T22:28:03Z

@thanks for finding the issue. Does this impact only keepalive module or we need to patch round robin /hash modules as well?

ariya · 2016-08-14T18:03:49Z

Rather than patching keep alive, round-robin hash, and other modules, IMO the real fix is to ensure that this module does not mess with the memory allocations. See pull request #21.

rohitjoshi · 2016-08-14T21:41:49Z

@dgmoen If you are able to reproduce this issue, would you mind trying PR #21

dgmoen · 2016-08-15T16:48:11Z

@rohitjoshi, we've tested the fix from @ariya, and it does resolve the 100% CPU issue as well as the segmentation fault we've been experiencing. However, as @wandenberg noted, it does also leak memory. For the time being, we're managing this by periodically reloading the nginx workers.

Not an ideal solution, but much better than the currently available alternatives.

rohitjoshi · 2016-08-15T16:56:49Z

@dgmoen thanks for validating. I have switched to balancer_by_lua_block (https://github.com/agentzh/lua-resty-balancer/) so will have to reconfigure to use Nginx based upstream.

yangshuxin · 2016-09-11T05:45:08Z

I believe line 528 is problematic.

520     ngx_http_upstream_init_pt init;
...
527     if (dynamic_server->previous_pool != NULL) {
528         ngx_destroy_pool(dynamic_server->previous_pool);
529         dynamic_server->previous_pool = NULL;
530     }

I can come up a contrived scenario: suppose at the moment statement 528 is called, there is at least one active request, say request r1. While the pointer p = ngx_http_upstream_t->peer->data of this request still valid,
the ((ngx_http_upstream_rr_peer_data_t*)p)->current and other fields instantaneously become invalid.

By the time the r1 finishes, (assuming for some reason the proxy/upstream connection cannot keep alive, and RR load balancer), function ngx_http_upstream_free_round_robin_peer() will be called, and this function will deference those invalid pointers mentioned above.

I don't see keep-alive module directly suffers from the invalid memory access by statically reading the code, but it may be an indirectly victim.

yangshuxin · 2016-09-12T22:44:06Z

I spot another potential problem over here (it is unlikely take place though)

524     
525     if (init(&cf, dynamic_server->upstream_conf) != NGX_OK) {                                                          
526         ngx_log_error(NGX_LOG_ERR, ctx->resolver->log, 0, "upstream-dynamic-servers: Error re-initializing upstream after DNS changes");
527     }

If error take place, we proceeds as if nothing have happened.

yangshuxin · 2016-09-14T05:36:15Z

I realized a new problem after I come up a fix to the segfault problem: keepalive is initialized again and again; each time it's initiaized it simply clobber its cached connections without disconnecting them; overtime it will accumulate far more connections than the "keepalive" specifies.

Now I understand why keepalive suffers from memory access as well, looks like people has analyzed before (Apologize being an impatient reader).

To reproduce the segfault problem is pretty easy -- it does not need to setup special DNS server.Just change the src code such that it call ``init_upstreameach time resolving finish _no matter IP changed or not_, and in the mean time runab` command, with large concurrency value, against the Nginx server.

yangshuxin · 2016-09-20T18:47:21Z

typo at line 561? I think it make more sense to set the timer interval toreferesh_in than the constant 1000ms.

537     if (ctx->resolver->log->log_level & NGX_LOG_DEBUG_CORE) {
    ...            set refresh_in...
552     }
   ...
561     ngx_add_timer(&dynamic_server->timer, 1000);

rohitjoshi · 2016-11-05T20:34:07Z

As a workaround, I have fixed the issue using running external python script to periodically resolve dns for upstream host and if IP address changes, it reloads nginx.

https://github.com/rohitjoshi/reload-nginx

neilstuartcraig · 2017-02-10T15:20:22Z

Hi all

I have what I think is the same issue here. I can reproduce the situation where NGINX uses 100% CPU every time so i'd be happy to try to provide whatever debug info you need.

My setup:
Custom NGINX build based on 1.11.9, running on CentOS 7
Reverse proxy mode, TLS in and out
Local DNS resolver (bind, running on 127.0.0.1:53) configured as a fully recursuve resolver
Single upstream server configured as:

upstream demo-demo-pool-1 {
	least_conn;
	zone demo-demo-pool-1-zone 1M;
        server www.EXAMPLE.COM:443 fail_timeout=30 weight=1 resolve;
	keepalive 64;
}

nginx -V:

nginx version: nginx/1.11.9 (GTM)
built with OpenSSL 1.0.2j  26 Sep 2016
TLS SNI support enabled
configure arguments: --build='GTM' --prefix=/etc/nginx/current --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/current/nginx.conf --pid-path=/var/run/nginx.pid --error-log-path=/var/log/nginx/default-error.log --http-log-path=/var/log/nginx/default-access.log --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=gtmdaemon --group=gtmdaemon --with-http_realip_module --with-http_v2_module --with-http_ssl_module --with-http_geoip_module --with-pcre-jit --with-ipv6 --with-file-aio --with-cc-opt='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/headers-more-nginx-module --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/naxsi/naxsi_src --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/nginx-module-vts --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/nginx-upstream-dynamic-servers --with-openssl=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/openssl-1.0.2j --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/ngx_devel_kit --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/lua-nginx-module --add-module=/tmp/tmpQVLC8W/BUILD/nginx-1.11.9/ngx_brotli

My observations:

A modest load (e.g. ~10 req/sec) for the same URL triggers the behaviour (there is no need to hit different URLS
A test run of 2 minutes is sufficient, though i believe as little as 30 seconds is also enough
Switching to a non-local resolver makes no difference
It doesn't matter whether the URL being requested is or isn't cacheable by NGINX
Removing the resolve directive from the upstream server definition stops the CPU usage - i.e. it really is this module which causes it and not something else
The issue only started recently for me, i update my NGINX builds with each new mainline release and i don't remember this issue happening prior to 1.11.9 (or maybe 1.11.8) but i can't be totally sure of that without further testing
The issue definitely doesn't happen when serving non-upstream-proxied content (i tested again my built-in test objects which are served from the local filesystem and there's no problem)
Looking at the NGINX process in perf i see a LOT of calls to ngx_rwlock_wlock (in ngx src at src/core/ngx_rwlock.c) - though this could be a side effect, i am not sure and not familiar enough with the source code

I'm going to try to change some config and see if i can make any progress. Met me know if i can be any help and i'll feed back if i find anything useful.

Cheers

neilstuartcraig · 2017-02-10T15:48:45Z

Just adding some output from perf top during the problem:

Samples: 196K of event 'cpu-clock', Event count (approx.): 18812008254, DSO: nginx
Overhead  Symbol
  87.42%  [.] ngx_rwlock_wlock                                                                                                                                                                                                             
   0.00%  [.] sched_yield@plt                                                                                                                                                                                                              
   0.00%  [.] ngx_resolve_name_locked                                                                                                                                                                                                      
   0.00%  [.] ngx_cmp_sockaddr                                                                                                                                                                                                             
   0.00%  [.] ngx_epoll_process_events

neilstuartcraig · 2017-02-10T16:21:59Z

I just found that updating to latest nginx-upstream-dynamic-servers has cured my issue, very happy about that! Apologies if the above is spammy.

jstangroome · 2018-02-19T05:10:36Z

I've been using the (currently) latest 29e05c5 commit on master and hit this same issue with an upstream that had a very low DNS TTL and changed IP regularly. CPU usage went very high, systemtap showed time was spent in the same call stack shared by others above, and occasional core dumps from segmentation faults also shared the call stack with those above.

I was able to somewhat frequently reproduce the issue in isolation by using dnsmasq to flip-flop a DNS record every second while using ab to perform at least 100 concurrent requests for 15 seconds.

Building the module with the b3ded6c commit on the refactor_replace_peers_refactor branch instead appears to have mitigated the problem based on early results with live traffic.

zhaofeng0019 · 2020-06-21T04:54:53Z

i think i have solve this problem.
would you like try my module or check it out for me?
thanks a lot
https://github.com/zhaofeng0019/nginx-upstream-dynamic-resolve-servers

alexzzh · 2022-12-21T09:18:31Z

@rohitjoshi, we've tested the fix from @ariya, and it does resolve the 100% CPU issue as well as the segmentation fault we've been experiencing. However, as @wandenberg noted, it does also leak memory. For the time being, we're managing this by periodically reloading the nginx workers.

Not an ideal solution, but much better than the currently available alternatives.

@dgmoen how about this solution #21 in your project, did you find other problem? how often reload nginx when use this solution can work well?

rohitjoshi changed the title ~~Help needed: Worker stuck in infinite loop when keepalive is enabled~~ Worker process stuck in infinite loop when keepalive is enabled Jul 6, 2016

rohitjoshi mentioned this issue Jul 7, 2016

Added support for refresh_in time and optional debug log #19

Open

ariya mentioned this issue Aug 14, 2016

Do not use any local memory pool #21

Closed

alexzzh mentioned this issue Dec 18, 2022

nginx 多进程模式下，该模块功能是否会有异常? zhaofeng0019/nginx-upstream-dynamic-resolve-servers#7

Closed

alexzzh mentioned this issue Apr 30, 2024

后端动态域名解析使用的是哪个插件，我尝试使用 jdomain 无法通过测试 alexzzh/ngx_health_detect_module#12

Open

Worker process stuck in infinite loop when keepalive is enabled #18

Worker process stuck in infinite loop when keepalive is enabled #18

Comments

rohitjoshi commented Jul 6, 2016 • edited Loading

wandenberg commented Jul 7, 2016

rohitjoshi commented Jul 7, 2016 • edited Loading

pgokul commented Jul 9, 2016 • edited Loading

wandenberg commented Jul 10, 2016

rohitjoshi commented Jul 10, 2016

wandenberg commented Jul 10, 2016

JanJakes commented Jul 11, 2016

pgokul commented Jul 11, 2016

rohitjoshi commented Jul 12, 2016

wandenberg commented Jul 12, 2016

pgokul commented Jul 12, 2016

JanJakes commented Jul 12, 2016 • edited Loading

rohitjoshi commented Jul 20, 2016

wandenberg commented Jul 20, 2016

rohitjoshi commented Jul 20, 2016

wandenberg commented Jul 20, 2016

kipras commented Aug 2, 2016

wandenberg commented Aug 2, 2016

kipras commented Aug 2, 2016

wandenberg commented Aug 2, 2016

jsoverson commented Aug 5, 2016

wandenberg commented Aug 10, 2016

dgmoen commented Aug 10, 2016 • edited Loading

rohitjoshi commented Aug 10, 2016 • edited Loading

wandenberg commented Aug 11, 2016

rohitjoshi commented Aug 11, 2016 • edited Loading

kipras commented Aug 12, 2016 • edited Loading

marcelomd commented Aug 12, 2016

rohitjoshi commented Aug 12, 2016

ariya commented Aug 14, 2016

rohitjoshi commented Aug 14, 2016 • edited Loading

dgmoen commented Aug 15, 2016

rohitjoshi commented Aug 15, 2016

yangshuxin commented Sep 11, 2016

yangshuxin commented Sep 12, 2016

yangshuxin commented Sep 14, 2016

yangshuxin commented Sep 20, 2016

rohitjoshi commented Nov 5, 2016

neilstuartcraig commented Feb 10, 2017

neilstuartcraig commented Feb 10, 2017

neilstuartcraig commented Feb 10, 2017

jstangroome commented Feb 19, 2018

zhaofeng0019 commented Jun 21, 2020

alexzzh commented Dec 21, 2022

rohitjoshi commented Jul 6, 2016 •

edited

Loading

rohitjoshi commented Jul 7, 2016 •

edited

Loading

pgokul commented Jul 9, 2016 •

edited

Loading

JanJakes commented Jul 12, 2016 •

edited

Loading

dgmoen commented Aug 10, 2016 •

edited

Loading

rohitjoshi commented Aug 10, 2016 •

edited

Loading

rohitjoshi commented Aug 11, 2016 •

edited

Loading

kipras commented Aug 12, 2016 •

edited

Loading

rohitjoshi commented Aug 14, 2016 •

edited

Loading