Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Scannable#name() and related logic #3901

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Stephan202
Copy link
Contributor

@Stephan202 Stephan202 commented Oct 3, 2024

Performance is improved in two ways:

  • By invoking Scannable#stepName() only when Attr.NAME is not
    explicitly set.
  • By optimizing Traces#extractOperatorAssemblyInformationParts, to
    which several Scannable#stepName() implementations delegate.

The Scannable#name() logic may be executed many times, e.g. if a hot
code path uses {Mono,Flux}#log or Micrometer instrumentation. The
added benchmark shows that for large stack traces, the new Traces
implementation is several orders of magnitude more efficient in terms of
compute and memory resource utilization.

Deferral of invocation of Scannable#stepName() assumes that said
method does not have side-effects. This is true for all built-in
implementations.

While there, improve two existing benchmarks by utilizing the black hole
to which benchmark method return values are implicitly sent.

@Stephan202 Stephan202 requested a review from a team as a code owner October 3, 2024 10:34
Copy link
Contributor Author

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments with context. This PR relates to #3900.

Comment on lines +52 to +53
public Boolean measureThroughput() {
return Flux.range(0, rangeSize)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in MonoCallableBenchmark: as stated, these are unrelated improvements. Can be pulled into a separate PR if desired.

Comment on lines +33 to +39
@BenchmarkMode({Mode.AverageTime})
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class TracesBenchmark {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration and setup for this benchmark match those of the existing benchmarks, including explicit specification of default parameters. Can be cleaned up a bit more if desired.

As for the impact of this PR, the following shows the before- and after output of ./gradlew jmh --include="TracesBenchmark" --profilers="gc":

Before:

Benchmark                                             (reactorLeadingLines)  (trailingLines)  Mode  Cnt       Score      Error   Units
TracesBenchmark.measureThroughput                                         0                0  avgt    5      93.082 ±    5.671   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                0  avgt    5    8607.367 ±  515.275  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                0  avgt    5     840.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                0  avgt    5     117.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0                0  avgt    5     153.000                 ms
TracesBenchmark.measureThroughput                                         0               10  avgt    5     398.660 ±   16.663   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0               10  avgt    5    7081.158 ±  294.949  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0               10  avgt    5    2960.001 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0               10  avgt    5      96.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0               10  avgt    5     124.000                 ms
TracesBenchmark.measureThroughput                                         0              100  avgt    5    2999.075 ±   89.655   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0              100  avgt    5    7364.662 ±  220.010  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0              100  avgt    5   23160.019 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0              100  avgt    5     100.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0              100  avgt    5     131.000                 ms
TracesBenchmark.measureThroughput                                         0             1000  avgt    5   27916.269 ± 1601.038   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0             1000  avgt    5    7751.034 ±  442.501  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0             1000  avgt    5  226865.228 ±    0.070    B/op
TracesBenchmark.measureThroughput:gc.count                                0             1000  avgt    5     105.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0             1000  avgt    5     133.000                 ms
TracesBenchmark.measureThroughput                                        10                0  avgt    5     402.076 ±   11.693   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10                0  avgt    5    7476.144 ±  216.947  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10                0  avgt    5    3152.001 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10                0  avgt    5     102.000             counts
TracesBenchmark.measureThroughput:gc.time                                10                0  avgt    5     131.000                 ms
TracesBenchmark.measureThroughput                                        10               10  avgt    5     626.244 ±   31.574   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10               10  avgt    5    8053.582 ±  403.074  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10               10  avgt    5    5288.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10               10  avgt    5     109.000             counts
TracesBenchmark.measureThroughput:gc.time                                10               10  avgt    5     140.000                 ms
TracesBenchmark.measureThroughput                                        10              100  avgt    5    3331.927 ±  253.038   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10              100  avgt    5    7608.535 ±  564.012  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10              100  avgt    5   26576.019 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10              100  avgt    5     104.000             counts
TracesBenchmark.measureThroughput:gc.time                                10              100  avgt    5     137.000                 ms
TracesBenchmark.measureThroughput                                        10             1000  avgt    5   27321.302 ± 1180.729   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10             1000  avgt    5    7992.724 ±  345.188  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10             1000  avgt    5  228969.037 ±    0.044    B/op
TracesBenchmark.measureThroughput:gc.count                               10             1000  avgt    5     109.000             counts
TracesBenchmark.measureThroughput:gc.time                                10             1000  avgt    5     140.000                 ms
TracesBenchmark.measureThroughput                                       100                0  avgt    5    3119.034 ±   70.765   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100                0  avgt    5    7555.751 ±  170.859  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100                0  avgt    5   24712.018 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100                0  avgt    5     102.000             counts
TracesBenchmark.measureThroughput:gc.time                               100                0  avgt    5     131.000                 ms
TracesBenchmark.measureThroughput                                       100               10  avgt    5    3384.598 ±  134.383   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100               10  avgt    5    7894.374 ±  310.075  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100               10  avgt    5   28016.020 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100               10  avgt    5     107.000             counts
TracesBenchmark.measureThroughput:gc.time                               100               10  avgt    5     141.000                 ms
TracesBenchmark.measureThroughput                                       100              100  avgt    5    5920.906 ±  192.896   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100              100  avgt    5    7678.427 ±  250.022  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100              100  avgt    5   47672.060 ±    0.002    B/op
TracesBenchmark.measureThroughput:gc.count                              100              100  avgt    5     104.000             counts
TracesBenchmark.measureThroughput:gc.time                               100              100  avgt    5     141.000                 ms
TracesBenchmark.measureThroughput                                       100             1000  avgt    5   28805.156 ± 1329.846   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100             1000  avgt    5    8211.734 ±  378.784  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100             1000  avgt    5  248017.124 ±    0.052    B/op
TracesBenchmark.measureThroughput:gc.count                              100             1000  avgt    5     111.000             counts
TracesBenchmark.measureThroughput:gc.time                               100             1000  avgt    5     147.000                 ms
TracesBenchmark.measureThroughput                                      1000                0  avgt    5   27765.904 ± 1098.007   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000                0  avgt    5    8341.364 ±  333.465  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000                0  avgt    5  242849.078 ±    0.042    B/op
TracesBenchmark.measureThroughput:gc.count                             1000                0  avgt    5     113.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000                0  avgt    5     145.000                 ms
TracesBenchmark.measureThroughput                                      1000               10  avgt    5   28772.385 ± 1965.006   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000               10  avgt    5    8115.825 ±  550.707  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000               10  avgt    5  244809.121 ±    0.076    B/op
TracesBenchmark.measureThroughput:gc.count                             1000               10  avgt    5     111.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000               10  avgt    5     145.000                 ms
TracesBenchmark.measureThroughput                                      1000              100  avgt    5   31094.765 ± 1220.979   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000              100  avgt    5    8049.444 ±  315.635  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000              100  avgt    5  262450.032 ±    0.080    B/op
TracesBenchmark.measureThroughput:gc.count                             1000              100  avgt    5     109.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000              100  avgt    5     142.000                 ms
TracesBenchmark.measureThroughput                                      1000             1000  avgt    5   55008.078 ± 3723.012   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000             1000  avgt    5    8252.861 ±  557.526  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000             1000  avgt    5  475940.138 ±    0.280    B/op
TracesBenchmark.measureThroughput:gc.count                             1000             1000  avgt    5     112.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000             1000  avgt    5     148.000                 ms

After

Benchmark                                             (reactorLeadingLines)  (trailingLines)  Mode  Cnt      Score      Error   Units
TracesBenchmark.measureThroughput                                         0                0  avgt    5     28.283 ±    2.319   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                0  avgt    5  12142.383 ±  996.110  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                0  avgt    5    360.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                0  avgt    5    165.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0                0  avgt    5    218.000                 ms
TracesBenchmark.measureThroughput                                         0               10  avgt    5     48.606 ±    4.176   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0               10  avgt    5  11305.217 ±  943.994  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0               10  avgt    5    576.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0               10  avgt    5    154.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0               10  avgt    5    211.000                 ms
TracesBenchmark.measureThroughput                                         0              100  avgt    5     47.248 ±    2.156   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0              100  avgt    5  11626.888 ±  524.566  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0              100  avgt    5    576.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0              100  avgt    5    158.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0              100  avgt    5    204.000                 ms
TracesBenchmark.measureThroughput                                         0             1000  avgt    5     47.641 ±    4.503   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0             1000  avgt    5  11535.133 ± 1068.897  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0             1000  avgt    5    576.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0             1000  avgt    5    157.000             counts
TracesBenchmark.measureThroughput:gc.time                                 0             1000  avgt    5    199.000                 ms
TracesBenchmark.measureThroughput                                        10                0  avgt    5     47.941 ±    1.134   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10                0  avgt    5  10821.406 ±  255.277  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10                0  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10                0  avgt    5    147.000             counts
TracesBenchmark.measureThroughput:gc.time                                10                0  avgt    5    182.000                 ms
TracesBenchmark.measureThroughput                                        10               10  avgt    5     48.681 ±    3.065   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10               10  avgt    5  10658.850 ±  672.502  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10               10  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10               10  avgt    5    145.000             counts
TracesBenchmark.measureThroughput:gc.time                                10               10  avgt    5    189.000                 ms
TracesBenchmark.measureThroughput                                        10              100  avgt    5     48.309 ±    3.576   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10              100  avgt    5  10741.652 ±  778.802  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10              100  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10              100  avgt    5    147.000             counts
TracesBenchmark.measureThroughput:gc.time                                10              100  avgt    5    202.000                 ms
TracesBenchmark.measureThroughput                                        10             1000  avgt    5     46.582 ±    2.587   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                          10             1000  avgt    5  11138.595 ±  611.256  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                     10             1000  avgt    5    544.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                               10             1000  avgt    5    152.000             counts
TracesBenchmark.measureThroughput:gc.time                                10             1000  avgt    5    202.000                 ms
TracesBenchmark.measureThroughput                                       100                0  avgt    5     46.127 ±    2.642   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100                0  avgt    5  11248.700 ±  638.495  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100                0  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100                0  avgt    5    153.000             counts
TracesBenchmark.measureThroughput:gc.time                               100                0  avgt    5    201.000                 ms
TracesBenchmark.measureThroughput                                       100               10  avgt    5     46.186 ±    2.518   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100               10  avgt    5  11234.035 ±  610.123  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100               10  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100               10  avgt    5    153.000             counts
TracesBenchmark.measureThroughput:gc.time                               100               10  avgt    5    199.000                 ms
TracesBenchmark.measureThroughput                                       100              100  avgt    5     48.736 ±    1.891   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100              100  avgt    5  10645.285 ±  416.594  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100              100  avgt    5    544.000 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100              100  avgt    5    145.000             counts
TracesBenchmark.measureThroughput:gc.time                               100              100  avgt    5    185.000                 ms
TracesBenchmark.measureThroughput                                       100             1000  avgt    5     46.312 ±    2.365   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                         100             1000  avgt    5  11203.368 ±  573.569  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                    100             1000  avgt    5    544.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                              100             1000  avgt    5    152.000             counts
TracesBenchmark.measureThroughput:gc.time                               100             1000  avgt    5    198.000                 ms
TracesBenchmark.measureThroughput                                      1000                0  avgt    5     46.600 ±    2.220   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000                0  avgt    5  11133.818 ±  527.892  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000                0  avgt    5    544.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                             1000                0  avgt    5    152.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000                0  avgt    5    201.000                 ms
TracesBenchmark.measureThroughput                                      1000               10  avgt    5     45.805 ±    1.806   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000               10  avgt    5  11326.617 ±  446.037  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000               10  avgt    5    544.002 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                             1000               10  avgt    5    154.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000               10  avgt    5    206.000                 ms
TracesBenchmark.measureThroughput                                      1000              100  avgt    5     48.450 ±    2.778   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000              100  avgt    5  10709.195 ±  616.791  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000              100  avgt    5    544.003 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                             1000              100  avgt    5    145.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000              100  avgt    5    191.000                 ms
TracesBenchmark.measureThroughput                                      1000             1000  avgt    5     51.759 ±    2.299   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                        1000             1000  avgt    5  10023.884 ±  443.024  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                   1000             1000  avgt    5    544.004 ±    0.001    B/op
TracesBenchmark.measureThroughput:gc.count                             1000             1000  avgt    5    136.000             counts
TracesBenchmark.measureThroughput:gc.time                              1000             1000  avgt    5    174.000                 ms

Note how performance of the new implementation is almost independent of the input, while the old implementation scales fairly poorly.

Comment on lines 59 to 60
|| stackTraceRow.startsWith("reactor.core.publisher.Mono.onAssembly")
|| stackTraceRow.equals("reactor.core.publisher.Mono.onAssembly")
|| stackTraceRow.equals("reactor.core.publisher.Flux.onAssembly")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line removal is unrelated to the rest of the PR. The dropped line is redundant, as it is preceded by a predicate that matches in strictly more contexts. Question: should the Flux.onAssembly line (and some others below) be updated to also use startsWith?

@@ -103,7 +102,7 @@ static String extractOperatorAssemblyInformation(String source) {
}

static boolean isUserCode(String line) {
return !line.startsWith("reactor.core.publisher") || line.contains("Test");
return !line.startsWith(PUBLISHER_PACKAGE_PREFIX) || line.contains("Test");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant has a trailing dot. Afaik that doesn't matter, as each line should include a class name, too.

Comment on lines 130 to +131
static String[] extractOperatorAssemblyInformationParts(String source) {
String[] uncleanTraces = source.split("\n");
final List<String> traces = Stream.of(uncleanTraces)
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
Iterator<String> traces = trimmedNonemptyLines(source);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The largest improvements in this PR come from these changes.

Key contributors to performance of the old implementation:

  • String#split accepts a regular expression. We're no longer performing the comparatively expensive operation of compiling regular expressions.
  • String#split allocates an array and substrings proportional to the provided input, covering a potentially large part of the input that does not at all influence the result of this method.
  • The Stream operation likewise processes irrelevant lines, and allocates a potentially large list.

The new implementation instead lazily iterates over the input, processing only relevant lines, and tracking only the two most-recently-seen lines.

Comment on lines +140 to 143
if (isUserCode(currentLine)) {
// No line is a Reactor API line.
return new String[]{currentLine};
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some logic was moved around, but existing comments were relocated with it. This should aid review. It's nice that there was already good test coverage.

else if (i == traces.size()) {
//we skipped ALL lines, meaning they're all reactor API lines. We'll fully display the last one
apiLine = "";
userCodeLine = traces.get(i-1).replaceFirst("reactor.core.publisher.", "");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: replaceFirst also accepts a regex. The new code instead performs a cheaper startsWith check.

Comment on lines +172 to +178
/**
* Returns an iterator over all trimmed non-empty lines in the given source string.
*
* @implNote This implementation attempts to minimize allocations.
*/
private static Iterator<String> trimmedNonemptyLines(String source) {
return new Iterator<String>() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manually-crafted iterator feels a bit like "coding like it's 1999", but I didn't find a less-verbose alternative that doesn't impact over-all code readability. (Had Guava been on the classpath, then I'd have opted to extend AbstractIterator.) Open to suggestions!

@chemicL
Copy link
Member

chemicL commented Oct 4, 2024

Thanks for the PR. I will have a closer look. However, bear in mind this change won't make it into 3.7.0-RC1 but would target directly 3.7.0. I think that's ok as it's not an API change. Together with #3900 it is a behaviour change, but unless there were side effects in code they would also not be observed aside from the performance gains. I think it's worth noting some sort of warning for the stepName lazy evaluation when we do release notes.

@chemicL chemicL added the area/performance This belongs to the performance theme label Oct 4, 2024
@Stephan202 Stephan202 changed the title Optimize Traces#extractOperatorAssemblyInformationParts Optimize Scannable#name() and related logic Oct 4, 2024
@Stephan202
Copy link
Contributor Author

Thanks for the feedback @chemicL! I cherry-picked the commit from #3900 into this branch and rewrote the PR title and summary.

@Stephan202
Copy link
Contributor Author

@chemicL I just realized that the impact of this PR is likely over-stated, as in practice the input stacktrace appears to always be generated by a CallSiteSupplierFactory implementation, both of which output at most two lines. (In my defense: the unit tests of the modified code also seem to indicate that more lines may be expected.)

So perhaps we can optimize the code further by skipping the intermediate single-string representation. I might have time for a closer look into that this weekend.

@Stephan202
Copy link
Contributor Author

Stephan202 commented Oct 7, 2024

as in practice the input stacktrace appears to always be generated by a CallSiteSupplierFactory implementation, both of which output at most two lines.

There's one other code path: CallSiteInfoAddingMethodVisitor passes a manually constructed two-line stack trace to Hooks#addCallSiteInfo.

So perhaps we can optimize the code further by skipping the intermediate single-string representation. I might have time for a closer look into that this weekend.

I now have a POC for this locally. Against a (ReactorDebugAgent-using) benchmark of local code it isn't yet faster that the current PR (surprisingly; requires more investigation), but for the cleanest and likely fastest code, it'd be better if we can have CallSiteInfoAddingMethodVisitor pass the two constructed stack frames separately. Doing this requires adding (or modifying) a public Hooks method, which causes :reactor-core:japicmp to report an API compatibility failure. Is that acceptable, and if so, how can I make that change without failing the build?

@chemicL
Copy link
Member

chemicL commented Nov 21, 2024

Hey, @Stephan202. I've been a bit busy lately but would like to revisit your PRs. Can you give an update on the above considerations? How impactful is this change or an alternative change from your PoC? I assume we would be able to alter the Hooks methods that are marked as deprecated with a huge warning they're for internal use. You only need to find japicmp configuration in build.gradle and add an entry to the methodExcludes = [] array.

@Stephan202
Copy link
Contributor Author

Thanks for the ping on this PR @chemicL; I meant to report back here. :shame:

I did try the alternative approach mentioned (including the Hooks customization), but testing it against an internal benchmark (one that contains some Picnic-specific code, but mostly causes Reactor logic to be executed, with Reactor Debug Agent enabled), I consistently found the alternative approach be perform slightly worse. I lost quite some time over that, as it really defied (and still defies) my intuitions.

I can look into polishing that code and pushing it to an alternative branch for a second opinion; will try to find some time. That said, based on the above, my tentative suggestion would be to proceed with this PR as-is. (Except perhaps for trimming the JMH benchmark inputs, because as mentioned, in practice the code will generally parse at most two stack frames, rather than 1000.) If the alternative approach can be made more performant after all, that can be tackled in a follow-up PR.

@Stephan202 Stephan202 force-pushed the sschroevers/traces-performance-improvement branch from f7ebe1c to f470638 Compare November 21, 2024 17:52
@Stephan202
Copy link
Contributor Author

Rebased branch; applied 100% cleanly.

@Stephan202
Copy link
Contributor Author

The experiments I tried are on this messy branch. If desired I can clean it up, though it's a bit TBD when I'll have time to dive back into this topic.

@chemicL
Copy link
Member

chemicL commented Nov 25, 2024

@Stephan202 I am just trying to understand whether the change is actually needed. As I understand you discovered that this will only be triggered for processing two stack frames (can you point to where it's limited to only 2?), therefore the benchmark is not relevant to the expected usage of this API, yes? And also, there is the risk of touching and changing a stable code base for not much benefit and potential regressions. Is there a possibility that these optimizations will have an actual effect on real world applications?

@Stephan202
Copy link
Contributor Author

@chemicL fair question! Based on earlier testing the answer is "yes, this is an improvement", but let me get back to you in the coming days with some more hard data. (Exact timing TBD.)

This logic may be executed many times, e.g. if a hot code path uses
`{Mono,Flux}#log` or Micrometer instrumentation. The added benchmark
shows that for large stack traces the new implementation is several
orders of magnitude more efficient in terms of compute and memory
resource utilization.

While there, improve two existing benchmarks by utilizing the black hole
to which benchmark method return values are implicitly sent.
(cherry picked from commit 009ec89)
@Stephan202 Stephan202 force-pushed the sschroevers/traces-performance-improvement branch from f470638 to c3e519a Compare November 27, 2024 21:49
@Stephan202
Copy link
Contributor Author

@chemicL alright, I rebased the branch on main and added a small commit to make the benchmark more realistic. In the remainder of this post I'm comparing this branch to the current HEAD of main (7cc701c), such that improvements of #3902 apply in each case / don't bias the result.

For the TracesBenchmark in this PR, I locally get the following results:

Benchmark before the changes
Benchmark                                             (reactorLeadingLines)  (trailingLines)  Mode  Cnt     Score     Error   Units
TracesBenchmark.measureThroughput                                         0                0  avgt    5    93.857 ±   2.071   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                0  avgt    5  8859.998 ± 195.529  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                0  avgt    5   872.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                0  avgt    5   121.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                0  avgt    5   155.000                ms
TracesBenchmark.measureThroughput                                         0                1  avgt    5   122.244 ±   3.167   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                1  avgt    5  8050.975 ± 209.703  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                1  avgt    5  1032.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                1  avgt    5   109.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                1  avgt    5   150.000                ms
TracesBenchmark.measureThroughput                                         0                2  avgt    5   143.698 ±   1.597   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                2  avgt    5  8175.946 ±  91.208  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                2  avgt    5  1232.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                2  avgt    5   111.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                2  avgt    5   169.000                ms
TracesBenchmark.measureThroughput                                         1                0  avgt    5   121.232 ±   2.857   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                0  avgt    5  8117.946 ± 190.510  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                0  avgt    5  1032.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                0  avgt    5   110.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                0  avgt    5   162.000                ms
TracesBenchmark.measureThroughput                                         1                1  avgt    5   155.369 ±   3.385   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                1  avgt    5  7561.889 ± 165.250  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                1  avgt    5  1232.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                1  avgt    5   103.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                1  avgt    5   160.000                ms
TracesBenchmark.measureThroughput                                         1                2  avgt    5   166.787 ±   3.910   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                2  avgt    5  8142.041 ± 191.441  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                2  avgt    5  1424.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                2  avgt    5   111.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                2  avgt    5   166.000                ms
TracesBenchmark.measureThroughput                                         2                0  avgt    5   144.471 ±   2.355   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                0  avgt    5  8237.969 ± 134.557  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                0  avgt    5  1248.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                0  avgt    5   112.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                0  avgt    5   164.000                ms
TracesBenchmark.measureThroughput                                         2                1  avgt    5   166.155 ±   1.974   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                1  avgt    5  8264.689 ±  98.186  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                1  avgt    5  1440.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                1  avgt    5   112.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                1  avgt    5   147.000                ms
TracesBenchmark.measureThroughput                                         2                2  avgt    5   197.339 ±   3.674   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                2  avgt    5  7925.318 ± 147.817  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                2  avgt    5  1640.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                2  avgt    5   108.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                2  avgt    5   146.000                ms
Benchmark after the changes
Benchmark                                             (reactorLeadingLines)  (trailingLines)  Mode  Cnt      Score     Error   Units
TracesBenchmark.measureThroughput                                         0                0  avgt    5     27.853 ±   1.429   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                0  avgt    5  12327.398 ± 629.891  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                0  avgt    5    360.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                0  avgt    5    168.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                0  avgt    5    251.000                ms
TracesBenchmark.measureThroughput                                         0                1  avgt    5     47.004 ±   3.899   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                1  avgt    5  11690.405 ± 971.380  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                1  avgt    5    576.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                1  avgt    5    159.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                1  avgt    5    216.000                ms
TracesBenchmark.measureThroughput                                         0                2  avgt    5     46.116 ±   3.718   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           0                2  avgt    5  11915.112 ± 943.275  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      0                2  avgt    5    576.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                0                2  avgt    5    162.000            counts
TracesBenchmark.measureThroughput:gc.time                                 0                2  avgt    5    209.000                ms
TracesBenchmark.measureThroughput                                         1                0  avgt    5     56.545 ±   1.064   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                0  avgt    5   9714.213 ± 182.537  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                0  avgt    5    576.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                0  avgt    5    133.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                0  avgt    5    196.000                ms
TracesBenchmark.measureThroughput                                         1                1  avgt    5     45.934 ±   2.003   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                1  avgt    5  11959.417 ± 520.615  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                1  avgt    5    576.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                1  avgt    5    163.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                1  avgt    5    221.000                ms
TracesBenchmark.measureThroughput                                         1                2  avgt    5     49.027 ±   3.465   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           1                2  avgt    5  11206.856 ± 785.266  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      1                2  avgt    5    576.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                1                2  avgt    5    153.000            counts
TracesBenchmark.measureThroughput:gc.time                                 1                2  avgt    5    214.000                ms
TracesBenchmark.measureThroughput                                         2                0  avgt    5     47.588 ±   2.095   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                0  avgt    5  10902.505 ± 477.988  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                0  avgt    5    544.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                0  avgt    5    148.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                0  avgt    5    200.000                ms
TracesBenchmark.measureThroughput                                         2                1  avgt    5     47.097 ±   2.317   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                1  avgt    5  11016.262 ± 539.726  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                1  avgt    5    544.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                1  avgt    5    149.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                1  avgt    5    182.000                ms
TracesBenchmark.measureThroughput                                         2                2  avgt    5     45.345 ±   2.225   ns/op
TracesBenchmark.measureThroughput:gc.alloc.rate                           2                2  avgt    5  11441.877 ± 557.287  MB/sec
TracesBenchmark.measureThroughput:gc.alloc.rate.norm                      2                2  avgt    5    544.000 ±   0.001    B/op
TracesBenchmark.measureThroughput:gc.count                                2                2  avgt    5    156.000            counts
TracesBenchmark.measureThroughput:gc.time                                 2                2  avgt    5    214.000                ms

In short, this means a >3x speedup and more than halving of allocated memory for the most common 2-line case.

I also had another look at a "more representative" Picnic-internal benchmark, where some longer reactive chains are subscribed to. I can't easily share this code, but with the Reactor Debug Agent enabled (as we do in production) there's an 18-22% speedup and ~40% reduction in allocated memory:

Internal benchmark before the changes
Benchmark                                             (orderDepth)  (transformationStepCount)  Mode  Cnt       Score      Error   Units
TransformationBenchmark.transform                                5                         20  avgt    5     150.550 ±    2.812   us/op
TransformationBenchmark.transform:gc.alloc.rate                  5                         20  avgt    5    1830.314 ±   26.565  MB/sec
TransformationBenchmark.transform:gc.alloc.rate.norm             5                         20  avgt    5  288942.054 ± 1734.845    B/op
TransformationBenchmark.transform:gc.count                       5                         20  avgt    5     150.000             counts
TransformationBenchmark.transform:gc.time                        5                         20  avgt    5     184.000                 ms
Internal benchmark after the changes
Benchmark                                             (orderDepth)  (transformationStepCount)  Mode  Cnt       Score    Error   Units
TransformationBenchmark.transform                                5                         20  avgt    5     120.198 ±  0.804   us/op
TransformationBenchmark.transform:gc.alloc.rate                  5                         20  avgt    5    1358.328 ±  9.011  MB/sec
TransformationBenchmark.transform:gc.alloc.rate.norm             5                         20  avgt    5  171202.277 ± 18.866    B/op
TransformationBenchmark.transform:gc.count                       5                         20  avgt    5     111.000           counts
TransformationBenchmark.transform:gc.time                        5                         20  avgt    5     141.000               ms

Our main store application is a modular monolith that makes very heavy use of Reactor, with some key request flows creating very long reactive chains. I'm reasonably confident we'll see a noticeable latency improvement with this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance This belongs to the performance theme
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants