-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a metric that introspects why pods failed in the cluster #725
Comments
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ? |
Are these computed by reported based on documents stored on ceph? |
Yes, analyze every morning for the day before by |
Daily sounds reasonable. 👍🏻
So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect
With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state. If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain). Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case. |
I see your point, in that case what justification is reported by adviser? So we need to find a way to read exit code of the pods to be reported immediately (we only have the percentage of adviser failures every moment and then asynchronously we analyze the reason from the documents on Ceph) here errors are decreasnig but workflows failures are increasing (ocp4-stage), while succeeded one are not changing much We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off) |
There is no justification created as the pod errored. adviser reports the followin error information:
Yes, this metric discussed before on calls is not applicable to this case - in this case, the system produces documents, but does not satisfy user requests. The metric you brought introspects if system produces any documents (and should alert as well if not). |
/priority important-soon |
@goern: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage accepted |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-longterm |
/sig observability |
Potentially relevant metrics From kube-state-metrics:
From argo workflow controller:
+ all the metrics documented at https://argoproj.github.io/argo-workflows/metrics/#default-controller-metrics, probably Beyond that, custom workflow metrics (metrics defined in Workflow spec, from what I gather) looks relevant. |
relevant : kubernetes/kube-state-metrics#1481 (the issue is only closed because it's old, not because it's refused). |
Some opinions.
Since exit codes (most of them) we can use them to map to any reason we like. |
(I'll unssagnim myself, I don't think we have a clear enough view of what we want to do yet with this) |
I think we should use the kube-state-metrics feature once the previously linked Unless someone has a different opinion, I propose we keep this frozen until the |
The kube-state-metrics got merged. I'll keep an eye on this when they release a new version. |
kube-state-metrics do releases something like every 2/4 months, from their Do we have an idea of what the timeline is for: kube-state-metrics new releases -> get in Openshift -> get on the clusters we Also, if we decide to go that route(=using kube-state-metrics) (do we ?), we Suggestion:
Description: Use |
sounds good to me. which of the parts if on op1st and which on us? |
They have the producer items (upgrade kube-state-metrics) we have the consumer (create the dasboard + alerts) ones (assuming those are handled as applications components in thoth-station) |
@VannTen did you open in issue to update kube-state-metrics? |
There isn't a release of kube-state-metrics with the merged PR yet, so I
was thinking we should wait for it before opening an issue.
|
ACK |
It looks like we should monitor https://github.com/openshift/cluster-monitoring-operator and/or https://github.com/openshift/kube-state-metrics . I'll check the git history later to see if the exit_code PR is there, and in which release branch. |
Is your feature request related to a problem? Please describe.
As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)
As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).
Describe the solution you'd like
Have a metric that exposes information about exit code returned by the corresponding container in a workflow.
We can sync how these components return the exit code and the semantics behind these exit codes.
Acceptance criteria
The text was updated successfully, but these errors were encountered: