resolve tracing span sometimes lingers even after response received #2435

SorenHolstHansen · 2024-09-26T11:45:32Z

We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).

A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)

endpoint:     -------------
something:    --
reqwest_call:   -----------
stuff_after:       --
response:           --

that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.

We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.

I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug

To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008

In signoz, you should then be able to see cases like these:

You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.

I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2024-09-26T13:16:52Z

I haven't followed the code to verify this, but a quick guess is this: when a request is started, the pool that reqwest uses will create two futures, and race them: 1) ask the pool for a newly idle connection, 2) create a new connection. If the second option is started, but then the first option wins the race (especially because resolving looks like its being slow), the second future will be spawned as a background task to finish up and insert the new connection into the pool (this reduces socket thrashing).

Perhaps that means the span around resolving outlives the original request span?

SorenHolstHansen · 2024-09-26T14:15:30Z

Yeah, the guy that answered the issue in the tracing repo thought the same thing about the span outliving the request span

SorenHolstHansen · 2024-09-27T07:44:05Z

Is it something we could convince you to take a look at?

seanmonstar · 2024-09-27T14:37:07Z

It's not something I can dig into myself, but others are welcome. The relevant span is here: https://github.com/hyperium/hyper-util/blob/fcb8565bb8090d90349b04b2caa9c8128a8ae59f/src/client/legacy/connect/dns.rs#L121

I don't know if there's a mechanism to disconnect a span from it's parent after creation, or some other solution.

mladedav · 2024-09-27T19:11:57Z

Not after creation. Parent child relationships are forever and all parents live as long as their children.

SorenHolstHansen · 2024-09-28T07:01:55Z

I thought that maybe a simple tracing subscriber filter could filter out that span? At least that was my next idea

SorenHolstHansen · 2024-09-30T07:31:07Z

Nvm, using a subscriber filter won't work since the span is created with .or_current() (see here), so can't disable it

seanmonstar · 2024-09-30T13:55:01Z

Perhaps the .or_current() could be removed, I'm not sure why it's there, I don't usually see that when declaring spans.

SorenHolstHansen · 2024-09-30T17:14:08Z

Looking at the docs for .or_current() it looks like it is the recommended way for that kind of situation, perhaps it would be better if the span had no parent, but .follows_from() its parent instead. Not an expert though, so perhaps @mladedav has some insights

seanmonstar · 2024-10-01T16:48:39Z

This seems like a possible solution: hyperium/hyper-util#153

SorenHolstHansen · 2024-10-01T17:20:21Z

Looks good to me!

svix-jplatte · 2024-11-01T12:12:07Z

The new hyper-util v0.1.10 is out, but we're still seeing panics from tracing that get resolved when downgrading it to v0.1.7. Previously I was under the impression that this behavior was a combination of tokio-rs/tracing#2870 and this bug (since the version it started happening with matches exactly), now I'm not so certain anymore.

@SorenHolstHansen Does the original bug still reproduce with hyper-util v0.1.10?

SorenHolstHansen · 2024-11-04T08:49:05Z

Hmmm, well the behavior is different now, but there still seems to be something off.

Now, when I run the repro case, I don't get cases where the http_request takes a short time, but the resolve takes a long time. Now, the http_request says it takes a long time as well when resolve does, i.e. see for instance this image

I tried to then filter out the resolve by setting an env-filter of trace,hyper-util=off,hyper_util=off, but then I just got this

i.e. I can't filter resolve away properly, which seems like a bug too, though I don't know where.

I don't know if that is related to your panics @svix-jplatte, or it's just a simple oversight from me?

seanmonstar · 2024-11-04T11:38:44Z

If this is still causing panics, we can yank it out. It was meant to be helpful, not crash 🙈

svix-jplatte · 2024-11-04T12:50:07Z

Well I'm pretty sure it's not (entirely) hyper-util's fault. It seems to only happen with some async tasks using a non-default tracing subscriber / dispatcher, so I think the tracing bug I linked is also involved.

seanmonstar · 2024-11-04T22:26:58Z

Yea, I'm certain it's a problem in tracing, hyper is doing something reasonable.

But, the benefit that change was supposed to provide is not worth causing crashes. Let's revert.

Andrey36652 · 2024-11-24T23:28:34Z

@SorenHolstHansen I skimmed through your description. Maybe this somehow relates to your problem? #2381

seanmonstar mentioned this issue Oct 1, 2024

refactor: allow resolve span to be disabled hyperium/hyper-util#153

Merged

seanmonstar changed the title ~~reqwest seems to hang around after instrumented function is done~~ resolve tracing span sometimes lingers even after response received Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolve tracing span sometimes lingers even after response received #2435

resolve tracing span sometimes lingers even after response received #2435

SorenHolstHansen commented Sep 26, 2024

seanmonstar commented Sep 26, 2024

SorenHolstHansen commented Sep 26, 2024

SorenHolstHansen commented Sep 27, 2024

seanmonstar commented Sep 27, 2024

mladedav commented Sep 27, 2024

SorenHolstHansen commented Sep 28, 2024

SorenHolstHansen commented Sep 30, 2024 •

edited

Loading

seanmonstar commented Sep 30, 2024

SorenHolstHansen commented Sep 30, 2024

seanmonstar commented Oct 1, 2024

SorenHolstHansen commented Oct 1, 2024

svix-jplatte commented Nov 1, 2024 •

edited

Loading

SorenHolstHansen commented Nov 4, 2024 •

edited

Loading

seanmonstar commented Nov 4, 2024

svix-jplatte commented Nov 4, 2024 •

edited

Loading

seanmonstar commented Nov 4, 2024

Andrey36652 commented Nov 24, 2024

resolve tracing span sometimes lingers even after response received #2435

resolve tracing span sometimes lingers even after response received #2435

Comments

SorenHolstHansen commented Sep 26, 2024

seanmonstar commented Sep 26, 2024

SorenHolstHansen commented Sep 26, 2024

SorenHolstHansen commented Sep 27, 2024

seanmonstar commented Sep 27, 2024

mladedav commented Sep 27, 2024

SorenHolstHansen commented Sep 28, 2024

SorenHolstHansen commented Sep 30, 2024 • edited Loading

seanmonstar commented Sep 30, 2024

SorenHolstHansen commented Sep 30, 2024

seanmonstar commented Oct 1, 2024

SorenHolstHansen commented Oct 1, 2024

svix-jplatte commented Nov 1, 2024 • edited Loading

SorenHolstHansen commented Nov 4, 2024 • edited Loading

seanmonstar commented Nov 4, 2024

svix-jplatte commented Nov 4, 2024 • edited Loading

seanmonstar commented Nov 4, 2024

Andrey36652 commented Nov 24, 2024

SorenHolstHansen commented Sep 30, 2024 •

edited

Loading

svix-jplatte commented Nov 1, 2024 •

edited

Loading

SorenHolstHansen commented Nov 4, 2024 •

edited

Loading

svix-jplatte commented Nov 4, 2024 •

edited

Loading