Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: critical alerts by modules #263

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,19 +303,27 @@ Holesky) this value should be omitted.
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators performance to Alertmanager.
* **Required:** false
---
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater this value.
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater or equal to this value.
* **Required:** false
* **Default:** 100
---
`CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT` - If number of validators in CSM module affected by the specific critical event is greater or equal to this value, the critical alert will be sent.
* **Required:** false
* **Default:** 1
---
`CRITICAL_ALERTS_ALERTMANAGER_LABELS` - Additional labels for critical alerts.
Must be in JSON string format. Example - '{"a":"valueA","b":"valueB"}'.
* **Required:** false
* **Default:** {}
---
`CSM_MODULE_ID` - ID of the CSM module in the Staking Router. If the CSM module doesn't exist, any value greater than the total number of Staking Router modules is accepted.
* **Required:** false
* **Default:** 3
---

## Application critical alerts (via Alertmanager)

In addition to alerts based on Prometheus metrics you can receive special critical alerts based on beaconchain aggregates from app.
In addition to alerts based on Prometheus metrics you can receive special critical alerts based on Beacon Chain aggregates from app.

You should pass env var `CRITICAL_ALERTS_ALERTMANAGER_URL=http://<alertmanager_host>:<alertmanager_port>`.

Expand All @@ -325,8 +333,8 @@ And if `ethereum_validators_monitoring_data_actuality < 1h` it allows you to rec
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators with missed attestations in the last {{ BAD_ATTESTATION_EPOCHS }} epochs | every 6h | every 1h |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators in curated modules with negative balance delta (between current and 6 epochs ago). More than `{{CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT}}` Node Operator validators in the CSM module with negative balance delta. | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators in curated modules with missed attestations in the last `{{BAD_ATTESTATION_EPOCHS}}` epochs. More than `{{CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT}}` Node Operator validators in the CSM module with missed attestations. | every 6h | every 1h |


## Application metrics
Expand Down
38 changes: 31 additions & 7 deletions src/common/alertmanager/alerts/BasicAlert.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,56 @@ export interface PreparedToSendAlert {
ruleResult: AlertRuleResult;
}

export interface PreparedToSendAlerts {
[moduleId: string]: PreparedToSendAlert;
}

export interface AlertRuleResult {
[operator: string]: any;
}

export interface AlertRulesResult {
[moduleId: string]: AlertRuleResult;
}

export abstract class Alert {
public readonly alertname: string;
protected sendTimestamp = 0;
protected sendTimestamp: {
[moduleId: string]: number
};
protected readonly config: ConfigService;
protected readonly storage: ClickhouseService;
protected readonly operators: RegistrySourceOperator[];

protected constructor(name: string, config: ConfigService, storage: ClickhouseService, operators: RegistrySourceOperator[]) {
this.alertname = name;
this.sendTimestamp = {};
this.config = config;
this.storage = storage;
this.operators = operators;
}

abstract alertRule(bySlot: number): Promise<AlertRuleResult>;
abstract alertRules(bySlot: number): Promise<AlertRulesResult>;

abstract sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean;

abstract alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody;

abstract sendRule(ruleResult?: AlertRuleResult): boolean;
async toSend(epoch: Epoch): Promise<PreparedToSendAlerts | {}> {
const rulesResult = await this.alertRules(epoch);
const moduleIds = Object.keys(rulesResult);
const result = {};

abstract alertBody(ruleResult: AlertRuleResult): AlertRequestBody;
for (const moduleId of moduleIds) {
if (this.sendRule(moduleId, rulesResult[moduleId])) {
result[moduleId] = {
timestamp: this.sendTimestamp[moduleId],
body: this.alertBody(moduleId, rulesResult[moduleId]),
ruleResult: rulesResult[moduleId],
};
}
}

async toSend(epoch: Epoch): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule(epoch);
if (this.sendRule(ruleResult)) return { timestamp: this.sendTimestamp, body: this.alertBody(ruleResult), ruleResult };
return result;
}
}
67 changes: 45 additions & 22 deletions src/common/alertmanager/alerts/CriticalMissedAttestations.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { RegistrySourceOperator } from 'validators-registry';

import { Alert, AlertRequestBody, AlertRuleResult } from './BasicAlert';
import { Alert, AlertRequestBody, AlertRuleResult, AlertRulesResult } from './BasicAlert';

const validatorsWithMissedAttestationCountThreshold = (quantity: number) => {
return Math.min(quantity / 3, 1000);
Expand All @@ -17,57 +17,80 @@ export class CriticalMissedAttestations extends Alert {
super(CriticalMissedAttestations.name, config, storage, operators);
}

async alertRule(epoch: Epoch): Promise<AlertRuleResult> {
const result: AlertRuleResult = {};
async alertRules(epoch: Epoch): Promise<AlertRulesResult> {
const criticalAlertsMinValCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT');
const csmModuleId = this.config.get('CSM_MODULE_ID');
const criticalAlertsMinValCSMAbsoluteCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_CSM_ABSOLUTE_COUNT');

const result: AlertRulesResult = {};
const nosStats = await this.storage.getUserNodeOperatorsStats(epoch);
const missedAttValidatorsCount = await this.storage.getValidatorCountWithMissedAttestationsLastNEpoch(epoch);
for (const noStats of nosStats.filter((o) => o.active_ongoing > this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT'))) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id == o.module && +noStats.val_nos_id == o.index);
const filteredNosStats = nosStats.filter((o) => (+o.val_nos_module_id === csmModuleId && o.active_ongoing >= criticalAlertsMinValCSMAbsoluteCount) || (+o.val_nos_module_id !== csmModuleId && o.active_ongoing >= criticalAlertsMinValCount));

for (const noStats of filteredNosStats) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id === o.module && +noStats.val_nos_id === o.index);
const missedAtt = missedAttValidatorsCount.find(
(a) => a.val_nos_id != null && +a.val_nos_module_id == operator.module && +a.val_nos_id == operator.index,
(a) => a.val_nos_id != null && +a.val_nos_module_id === operator.module && +a.val_nos_id === operator.index,
);
if (!missedAtt) continue;
if (missedAtt.amount > validatorsWithMissedAttestationCountThreshold(noStats.active_ongoing)) {
result[operator.name] = { ongoing: noStats.active_ongoing, missedAtt: missedAtt.amount };

if (missedAtt == null) continue;

if (
(+noStats.val_nos_module_id === csmModuleId && missedAtt.amount >= criticalAlertsMinValCSMAbsoluteCount) ||
(+noStats.val_nos_module_id !== csmModuleId &&
missedAtt.amount >= validatorsWithMissedAttestationCountThreshold(noStats.active_ongoing))
) {
if (result[noStats.val_nos_module_id] == null) {
result[noStats.val_nos_module_id] = {};
}
result[noStats.val_nos_module_id][operator.name] = { ongoing: noStats.active_ongoing, missedAtt: missedAtt.amount };
}
}

return result;
}

sendRule(ruleResult: AlertRuleResult): boolean {
sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean {
const defaultInterval = 6 * 60 * 60 * 1000; // 6h
const ifIncreasedInterval = 60 * 60 * 1000; // 1h
this.sendTimestamp = Date.now();
this.sendTimestamp[moduleId] = Date.now();

if (Object.values(ruleResult).length > 0) {
const prevSendTimestamp = sentAlerts[this.alertname]?.timestamp ?? 0;
if (this.sendTimestamp - prevSendTimestamp > defaultInterval) return true;
const sentAlertsForModule = sentAlerts[this.alertname] != null ? sentAlerts[this.alertname][moduleId] : null;
const prevSendTimestamp = sentAlertsForModule?.timestamp ?? 0;

if (this.sendTimestamp[moduleId] - prevSendTimestamp > defaultInterval) return true;

for (const [operator, operatorResult] of Object.entries(ruleResult)) {
const missedAtt = sentAlertsForModule?.ruleResult[operator].missedAtt ?? 0;

// if any operator has increased bad validators count or another bad operator has been added
if (
operatorResult.missedAtt > (sentAlerts[this.alertname]?.ruleResult[operator]?.missedAtt ?? 0) &&
this.sendTimestamp - prevSendTimestamp > ifIncreasedInterval
)
return true;
if (operatorResult.missedAtt > missedAtt && (this.sendTimestamp[moduleId] - prevSendTimestamp > ifIncreasedInterval)) return true;
}
}

return false;
}

alertBody(ruleResult: AlertRuleResult): AlertRequestBody {
alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody {
const timestampDate = new Date(this.sendTimestamp[moduleId]);
const timestampDatePlusTwoMins = new Date(this.sendTimestamp[moduleId]).setMinutes(timestampDate.getMinutes() + 2);

return {
startsAt: new Date(this.sendTimestamp).toISOString(),
endsAt: new Date(new Date(this.sendTimestamp).setMinutes(new Date(this.sendTimestamp).getMinutes() + 2)).toISOString(),
startsAt: timestampDate.toISOString(),
endsAt: new Date(timestampDatePlusTwoMins).toISOString(),
labels: {
alertname: this.alertname,
severity: 'critical',
nos_module_id: moduleId,
...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS'),
},
annotations: {
summary: `${
Object.values(ruleResult).length
} Node Operators with CRITICAL count of validators with missed attestations in the last ${this.config.get(
'BAD_ATTESTATION_EPOCHS',
)} epoch`,
)} epoch in module ${moduleId}`,
description: join(
Object.entries(ruleResult).map(([o, r]) => `${o}: ${r.missedAtt} of ${r.ongoing}`),
'\n',
Expand Down
65 changes: 45 additions & 20 deletions src/common/alertmanager/alerts/CriticalMissedProposes.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { RegistrySourceOperator } from 'validators-registry';

import { Alert, AlertRequestBody, AlertRuleResult } from './BasicAlert';
import { Alert, AlertRequestBody, AlertRuleResult, AlertRulesResult } from './BasicAlert';

const VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD = 1 / 3;

Expand All @@ -15,48 +15,73 @@ export class CriticalMissedProposes extends Alert {
super(CriticalMissedProposes.name, config, storage, operators);
}

async alertRule(epoch: Epoch): Promise<AlertRuleResult> {
const result: AlertRuleResult = {};
async alertRules(epoch: Epoch): Promise<AlertRulesResult> {
const criticalAlertsMinValCount = this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT');
const csmModuleId = this.config.get('CSM_MODULE_ID');

const result: AlertRulesResult = {};
const nosStats = await this.storage.getUserNodeOperatorsStats(epoch);
const proposes = await this.storage.getUserNodeOperatorsProposesStats(epoch); // ~12h range
for (const noStats of nosStats.filter((o) => o.active_ongoing > this.config.get('CRITICAL_ALERTS_MIN_VAL_COUNT'))) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id == o.module && +noStats.val_nos_id == o.index);
const filteredNosStats = nosStats.filter((o) => +o.val_nos_module_id === csmModuleId || o.active_ongoing >= criticalAlertsMinValCount);

for (const noStats of filteredNosStats) {
const operator = this.operators.find((o) => +noStats.val_nos_module_id === o.module && +noStats.val_nos_id === o.index);
const proposeStats = proposes.find(
(a) => a.val_nos_id != null && +a.val_nos_module_id == operator.module && +a.val_nos_id == operator.index,
(a) => a.val_nos_id != null && +a.val_nos_module_id === operator.module && +a.val_nos_id === operator.index,
);
if (!proposeStats) continue;
if (proposeStats.missed > proposeStats.all * VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD) {
result[operator.name] = { all: proposeStats.all, missed: proposeStats.missed };

if (proposeStats == null) continue;

if (proposeStats.missed >= proposeStats.all * VALIDATORS_WITH_MISSED_PROPOSALS_COUNT_THRESHOLD) {
if (result[noStats.val_nos_module_id] == null) {
result[noStats.val_nos_module_id] = {};
}
result[noStats.val_nos_module_id][operator.name] = { all: proposeStats.all, missed: proposeStats.missed };
}
}

return result;
}

sendRule(ruleResult: AlertRuleResult): boolean {
sendRule(moduleId: string, ruleResult: AlertRuleResult): boolean {
const defaultInterval = 6 * 60 * 60 * 1000; // 6h
this.sendTimestamp = Date.now();
this.sendTimestamp[moduleId] = Date.now();

if (Object.values(ruleResult).length > 0) {
const prevSendTimestamp = sentAlerts[this.alertname]?.timestamp ?? 0;
const sentAlertsForModule = sentAlerts[this.alertname] != null ? sentAlerts[this.alertname][moduleId] : null;
const prevSendTimestamp = sentAlertsForModule?.timestamp ?? 0;

for (const [operator, operatorResult] of Object.entries(ruleResult)) {
const prevAll = sentAlerts[this.alertname]?.ruleResult[operator]?.all ?? 0;
const prevMissed = sentAlerts[this.alertname]?.ruleResult[operator]?.missed ?? 0;
const prevAll = sentAlertsForModule?.ruleResult[operator].all ?? 0;
const prevMissed = sentAlertsForModule?.ruleResult[operator].missed ?? 0;
const prevMissedShare = prevAll === 0 ? 0 : prevMissed / prevAll;

// if math relation of missed to all increased
if (operatorResult.missed / operatorResult.all > prevMissedShare && this.sendTimestamp - prevSendTimestamp > defaultInterval)
if ((operatorResult.missed / operatorResult.all > prevMissedShare) && (this.sendTimestamp[moduleId] - prevSendTimestamp > defaultInterval))
return true;
}
}

return false;
}

alertBody(ruleResult: AlertRuleResult): AlertRequestBody {
alertBody(moduleId: string, ruleResult: AlertRuleResult): AlertRequestBody {
const timestampDate = new Date(this.sendTimestamp[moduleId]);
const timestampDatePlusTwoMins = new Date(this.sendTimestamp[moduleId]).setMinutes(timestampDate.getMinutes() + 2);

return {
startsAt: new Date(this.sendTimestamp).toISOString(),
endsAt: new Date(new Date(this.sendTimestamp).setMinutes(new Date(this.sendTimestamp).getMinutes() + 2)).toISOString(),
labels: { alertname: this.alertname, severity: 'critical', ...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS') },
startsAt: timestampDate.toISOString(),
endsAt: new Date(timestampDatePlusTwoMins).toISOString(),
labels: {
alertname: this.alertname,
severity: 'critical',
nos_module_id: moduleId,
...this.config.get('CRITICAL_ALERTS_ALERTMANAGER_LABELS'),
},
annotations: {
summary: `${Object.values(ruleResult).length} Node Operators with CRITICAL count of missed proposes in the last 12 hours`,
summary: `${
Object.values(ruleResult).length
} Node Operators with CRITICAL count of missed proposes in the last 12 hours in module ${moduleId}`,
description: join(
Object.entries(ruleResult).map(([o, r]) => `${o}: ${r.missed} of ${r.all} proposes`),
'\n',
Expand Down
Loading
Loading