Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlplane part of forwarder-vpp leaks #1129

Open
10 tasks done
NikitaSkrynnik opened this issue Jun 25, 2024 · 6 comments
Open
10 tasks done

Controlplane part of forwarder-vpp leaks #1129

NikitaSkrynnik opened this issue Jun 25, 2024 · 6 comments
Assignees
Labels
bug Something isn't working performance The problem related to system effectivity stability The problem is related to system stability

Comments

@NikitaSkrynnik
Copy link
Contributor

NikitaSkrynnik commented Jun 25, 2024

Description

forwarder-vpp has two interface leaks:

  1. timeout chain element doesn't call Close for expired connections. Therefore, vpp interfaces are not deleted
  2. Even if we call Close in forwarder-vpp it doesn't delete vxlan interfaces

Tasks

  • Investigate why timeout doesn't call Close - 8h
    • Make timeout chain element close connetions faster - 1h
    • Add logs to timeout chain element - 1h
    • Run tests with the modified timeout - 3h
    • Check collected logs - 3h
  • Fix timeout chain element - 24h
  • Investigate why forwarder-vpp doesn't close vxlan interfaces - 6h
    • Run scaling tests - 3h
    • Check collected logs - 3h
  • Fix vxlan problem - 16h

Total: 54h

@NikitaSkrynnik
Copy link
Contributor Author

All changes that fix tap interface leaks are in these PRs:

  1. Leak fixes sdk#1643
  2. Some changes that fix inteface leaks sdk-vpp#835

@NikitaSkrynnik NikitaSkrynnik moved this from In Progress to Under review in Release v1.14.0 Jul 5, 2024
@Ex4amp1e Ex4amp1e moved this from Under review to In Progress in Release v1.14.0 Jul 24, 2024
@Ex4amp1e Ex4amp1e moved this from In Progress to Done in Release v1.14.0 Jul 26, 2024
@Ex4amp1e Ex4amp1e moved this from Done to Under review in Release v1.14.0 Jul 26, 2024
@Ex4amp1e Ex4amp1e moved this from Under review to Blocked in Release v1.14.0 Jul 26, 2024
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 24, 2024

It seems like the problem is still existing and the forwarder is leaking.

image

The picture shows mem consumption for forwarder vpp.

forwarder-vpp_goroutineprofiles_20240920082027.tar.gz
forwarder-vpp_memprofiles_20240920082112.tar.gz

@denis-tingaikin denis-tingaikin added bug Something isn't working stability The problem is related to system stability performance The problem related to system effectivity labels Sep 24, 2024
@denis-tingaikin denis-tingaikin moved this from Blocked to Moved to next release in Release v1.14.0 Sep 24, 2024
@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Oct 3, 2024

Current plan

  • Test 20 clients and 20 endpoint on Azure cluster with 2 nodes for 24 hours. Collect memory, goroutine profiles, vpp memory profiles and kubectl top after 10 min of testing and after 24 hours of testing.
  • Test 40 clients and 1 endpoint on Azure cluster with 2 nodes for 24 hours. Scale clients for 0 to 40 every 60 seconds. Collect memory, goroutine profiles, vpp memory profiles and kubectl top at the beginning, after 10 min of testing and after 24 hours of testing.

@NikitaSkrynnik
Copy link
Contributor Author

NikitaSkrynnik commented Oct 14, 2024

After fixing context leaks in nsmonitor and timeout chain element forwarder shows the next results on 30 NSCs and 10 NSEs:

3h of testing:

nsm-system    forwarder-vpp-gr729                   135m         363Mi           
nsm-system    forwarder-vpp-zjd5x                   200m         390Mi           
nsm-system    nsmgr-25vvt                           167m         92Mi            
nsm-system    nsmgr-f6rjf                           92m          71Mi

18h of testing:

nsm-system    forwarder-vpp-gr729                   132m         401Mi           
nsm-system    forwarder-vpp-zjd5x                   199m         438Mi           
nsm-system    nsmgr-25vvt                           175m         97Mi            
nsm-system    nsmgr-f6rjf                           92m          74Mi 

3d20h of testing:

nsm-system    forwarder-vpp-gr729                   272m         421Mi           
nsm-system    forwarder-vpp-zjd5x                   310m         443Mi           
nsm-system    nsmgr-25vvt                           197m         103Mi           
nsm-system    nsmgr-f6rjf                           101m         78Mi   

4d5h of testing:

nsm-system    forwarder-vpp-gr729                   228m         417Mi           
nsm-system    forwarder-vpp-zjd5x                   203m         448Mi           
nsm-system    nsmgr-25vvt                           213m         101Mi           
nsm-system    nsmgr-f6rjf                           110m         83Mi  

30m after testing:

nsm-system    forwarder-vpp-gr729                   66m          414Mi           
nsm-system    forwarder-vpp-zjd5x                   63m          428Mi           
nsm-system    nsmgr-25vvt                           49m          87Mi            
nsm-system    nsmgr-f6rjf                           46m          73Mi 

memory and goroutine profiles don't show any leaks 30m after testing. Here are all profiles: traces.zip

@denis-tingaikin
Copy link
Member

@NikitaSkrynnik attached archive is empty, could you resend it?

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.15.0 Nov 5, 2024
@NikitaSkrynnik NikitaSkrynnik changed the title Controlplane part of forwarder-vpp leaks Controlplane part of forwarder-vpp and nsmgr leaks Nov 11, 2024
@NikitaSkrynnik
Copy link
Contributor Author

Forwarder VPPTesting

After 150h of highload testing with 30 clients and 10 endpoints on Azure cluster forwarder memory growth is:
chart

forwarder memory consumption is slowly growing. Go memory profiles show 6mb of allocated memory after 150h, goroutine count is 288.

vpp memory usage for forwarder-vpp-qhn7q is:
chart (1)

@NikitaSkrynnik NikitaSkrynnik changed the title Controlplane part of forwarder-vpp and nsmgr leaks Controlplane part of forwarder-vpp leaks Nov 14, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Under review in Release v1.15.0 Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance The problem related to system effectivity stability The problem is related to system stability
Projects
Status: No status
Status: Moved to next release
Status: In Progress
Status: Under review
Development

No branches or pull requests

2 participants