-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate why forwarder could restart too long #664
Comments
Most likely, this behavior can be explained and there are actually no outliers here. The forwarder will be in a ready state only when it can serve GRPC-clients. We have the forwarder readinessProbe: On page 11 we can see that the upgrade of the NSM is happening gradually including spire-agents.
|
Coming back to this topic after long time again... Since the presentation there were several changes in our configuration that made the upgrade faster. But the question is still valid about the forwarder startup time or mainly the traffic outage period. In the worker node reboot and forwarder-vpp pod restart scenarios we observe quite big differences regarding the period of time while the traffic does not work. We also measured it in a bigger environment where we got various figures between 60 and 120 seconds or even more. Is there a theory that could explain these huge differences? |
I managed to reproduce the issue when we restart a forwarder then the traffic cannot be recovered just a few minutes later. After some investigation it turned out when the forwarder restarted, only one of the NSM interfaces was re-created in the NSC pod, therefore the traffic failed for the affected network service. The setup is based on NSM v1.11.1 basic kernel to ethernet to kernel example. I used a kind cluster with 4 worker nodes. Then I started two
On the NSEs I ran
These clients send traffic to the addressed server via 100 connections with rate of 50 KB/s for 10 minutes. (It can happen that sometimes the When all the instances are running and connected properly, and the traffic is steady (no failing or connecting state, and no drops), the tool shows similar output:
Then I started deleting the forwarder pods one by one in a cycle. I always wait until the traffic is steady again after the new forwarder started up then delete the next one and check if the traffic recovered, like below:
After some iterations the traffic to one of the NSCs did not get connected just after few minutes. I tried to reproduce it with less nodes, less NSEs, also with I collected some logs but unfortunately the others were already rotated. @denis-tingaikin: Can you please check if there could be a fault in the healing mechanism that causes this behavior? |
@szvincze |
No, it is disabled in this case. |
@szvincze Most likely for the problem you described this will be enough. |
@glazychev-art: Our tests results shows the fix is working fine. |
Cool, thanks! |
Could there be backward compatibility issue if the forwarder-vpp and nsmgr contain this update but the NSC does not? |
I quickly checked and found that we missed the commit with the fix in the release candidate The good news is that it was just a release candidate. I'll create v1.11.2-rc.2 in a few hours with the missed fix. |
Now we have the missed 530e439255e773c2b92bc7fb8ccedba1a38e188a commit networkservicemesh/sdk@release/v1.11.1...release/v1.11.2-rc.2 Release v1.11.2-rc.2 is ready https://github.com/networkservicemesh/deployments-k8s/tree/release/v1.11.2-rc.2 Testing is in progress https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/7078369907 |
It was really quick, thanks! My original question is still valid. |
There will be no backward compatibility issues, but the original problem will not be fixed - the connection will still take a time to recover. |
Thanks. It was clear that to fix the original problem we need the new NSC but it is important to keep backward compatibility. |
Hi @glazychev-art & @denis-tingaikin, It seems that the problem still occurs with rc.2. However the previous test image ( Can you please double-check what could cause this behavior with rc.2? |
Hi @szvincze, |
Hello @szvincze If ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4 solves the problem, then
Do you have results for |
As I can see from the logs, we have a problem with the registry related to dial
I'd like suggest to re-test rc.3 with two options:
In parallel, I'll try to reproduce the problem and make sure that we don't have other issues that may produce the problem with dial. |
FYI: Also, if you are restarting the node that contains spire-server, please use this patch to avoid problems with dialling to the registry: networkservicemesh/deployments-k8s#10287 |
Current status:
/cc @edwarnicke , @szvincze |
Hi @denis-tingaikin,
Please find the logs for v1.12.0-rc.1. Some background about what happened in this test. Traffic started on the problematic connection at 2024-01-11T09:26:37.541, then forwarder-vpp-cpqz7 deleted at 2024-01-11T09:26:40.600. The new forwarder is ready at 2024-01-11T09:26:49. After the forwarder deleted this connection did not get packets for almost 10 minutes. Additional information on the problematic NSE.
The problematic connection to the traffic server address [100:100::1]:5003 in the NSC. |
Hi @szvincze |
The problem in this case was with IPv6 and most probably I got the IPv4 output by mistake. |
Hi, 1., In the NSE/NSC POD there's a linux bridge (with an IPv4/IPv6 address) to which the nsm interface(s) will be attached. Ping between the NSM endpoint IPs belonging to the same NSM connection might not work, or more like won't work in our case: IPv4: Requires proxy_ndp enabled, but is more complicated than IPv4, since proxy ndp must be handled on per address bases (for example by using an additional chain element to add/remove ndp entries to/from the ND table upon NSM Request/Close).
These "adjustments" for IPv4 and IPv6 could be skipped for example by using the bridge IP instead for the datapath monitoring. (Either as src or dst depending which link we consider.) 2., In order to send an ICMP echo request, you either need a RAW socket or a Datagram socket (UDP Ping, often referred to as unprivileged ping). To create RAW socket NET_RAW privilege is required.
Unfortunately we are not allowed to have requirements involving capabilities or setting ping_group_range sysctl towards the application POD. |
Thanks @zolug ,
|
1., They are 2 separate connections. I mostly highlighted the setup to emphasize that in our case there could be 2 "types" of datapath monitoring use cases: when the the bridge is in the NSE and when the bridge is in the NSC. 2., Hmm, I wasn't aware of this chain element enabling unprivileged ping for all group ids. |
@zolug If you have multiple NSC and NSE, do you use the same bridge, or is a unique one created for each pair? |
@glazychev-art |
@zolug |
@zolug |
@glazychev-art The reason why the bridge has IP address has mostly historical reasons and is related to "traffic" issues affecting L2 address resolution explained in the 1st point of this comment: #664 (comment). |
I don't think they are related. |
Thanks @zolug |
@zolug |
@glazychev-art The NSC --- NSE/NSC --- NSE setup is a simplified version of our architecture. Here's a (slightly outdated) picture better depicting our architecture. Also, IMHO all the delicate details are not that important, my goal was mainly to illustrate that a bridge could cause pings involving its slave port IPs to fail without proper arp_ignore and proxy ndp tweaks. Btw, I'm not expecting any changes in NSM to address this use case, just figured it could be worth sharing it with you. |
@zolug |
I'm a bit puzzled what logs you might be after. The details around bridge are not (or maybe loosely) connected to the original issue. I merely added this information here as I've been asked to do so. |
@zolug Based on this picture, there are still a few questions. Also the question is at what point the client receives IP 20.0.0.1. Does this mean that immediately after the Request the connection is unavailable because 172.16.0.3/24 is not manually configured? Perhaps I misunderstood you. Enabling datapath healing would allow us to more reliably solve the healing problem, I think |
The reason is described in #664 (comment): The IPs in the example (
That's correct. (There can be and normally should be multiple paths available in each Proxy POD to send a packet towards a load-balancer POD.)
Bridge is still involved, but multi-path routing will decide which slave interface the packet should be sent out towards a load-balancer.
It's received separately by our left NSC (Target) through an independent "communication channel", upon which the NSC will update the already established connection by sending out and updated Request with modified IPContext ( Example logs:
And here's how it looks like once an established NSM connection is updated with VIP addresses (
|
…k-sriov@main PR link: networkservicemesh/sdk-sriov#595 Commit: 53551c7 Author: Network Service Mesh Bot Date: 2024-05-18 13:39:18 -0500 Message: - Update go.mod and go.sum to latest version from networkservicemesh/sdk-kernel@main (#595) PR link: networkservicemesh/sdk-kernel#664 Commit: 9502001 Author: Network Service Mesh Bot Date: 2024-05-18 13:35:42 -0500 Message: - Update go.mod and go.sum to latest version from networkservicemesh/sdk@main (#664) PR link: networkservicemesh/sdk#1626 Commit: 7b51d9c Author: Vladislav Byrgazov Date: 2024-05-19 02:33:13 +0800 Message: - Fix memory leak in metrics chain element (#1626) * Fix memory leak in metrics chain element * Fix lint issues * Added check is opentelemetry enabled and fixed copyrights * Fix metrics memory leak by storing temp connection data in metadata * Added copyright * Address review comments * Fixed import --------- Signed-off-by: Vladislav Byrgazov <[email protected]> Signed-off-by: NSMBot <[email protected]>
…k-vpp@main PR link: networkservicemesh/sdk-vpp#830 Commit: 2fbe1a3 Author: Network Service Mesh Bot Date: 2024-05-18 13:39:38 -0500 Message: - Update go.mod and go.sum to latest version from networkservicemesh/sdk-kernel@main (#830) PR link: networkservicemesh/sdk-kernel#664 Commit: 9502001 Author: Network Service Mesh Bot Date: 2024-05-18 13:35:42 -0500 Message: - Update go.mod and go.sum to latest version from networkservicemesh/sdk@main (#664) PR link: networkservicemesh/sdk#1626 Commit: 7b51d9c Author: Vladislav Byrgazov Date: 2024-05-19 02:33:13 +0800 Message: - Fix memory leak in metrics chain element (#1626) * Fix memory leak in metrics chain element * Fix lint issues * Added check is opentelemetry enabled and fixed copyrights * Fix metrics memory leak by storing temp connection data in metadata * Added copyright * Address review comments * Fixed import --------- Signed-off-by: Vladislav Byrgazov <[email protected]> Signed-off-by: NSMBot <[email protected]>
Steps to reproduce
Repeat steps from https://drive.google.com/file/d/1Pwd7T6iYPItXEqjx_Yd5NfPE-T0zgUy5
The text was updated successfully, but these errors were encountered: