Alert Processing Engine

21 min

functional specification 1\ overview & purpose this document defines the functional requirements for the commandit alert processing engine this engine serves as the central component responsible for receiving alerts from various sources within the commandit platform (including parsed emails from the ai email triage agent, direct rmm agent events, snmp traps, api integrations), processing these alerts based on defined policies and rules, managing alert state (including frequency, duration, and flapping thresholds), and executing automated actions such as ticket creation, remediation scripts, or notifications, while respecting maintenance windows the primary goals are to standardize alert handling across different monitoring sources reduce alert noise by suppressing redundant or flapping alerts using stateful processing and thresholds automate responses and remediation actions based on configurable policies ensure actions respect maintenance windows and defined policies, with override capabilities provide consistent logging and context for processed alerts 2\ core principles stateful processing tracks the state of unique alert conditions over time using the alerts table policy driven actions, thresholds, and overrides are determined by configurable, hierarchical policies ( alertprocessingpolicies ) extensible designed to handle diverse alert inputs and trigger various defined actions context aware considers ci relationships, organizational structure, and maintenance schedules integrated works closely with commandit's psa ( tickets ), rmm ( devices , agentcommandqueue ), automation ( scripts ), notification ( notificationprofiles ), and policy modules efficient architecture favors asynchronous processing and optimized state management where feasible 3\ inputs & alert sources input format the engine expects alert data in a standardized json format upon invocation adherence is mandatory (see appendix a for full structure) input validation reject alerts missing mandatory fields ( alerttimestamputc , severity , ciidentifiers orgid , message ) log error, return 'processingfailed invalid input' potential sources ai email triage agent, rmm agent events, snmp trap listener, api integrations, internal checks sources must format data correctly allowed condition logic fields the field names usable in alertprocessingrules condition logic are derived directly from the standardized input json structure (section 3 / appendix a) final list validation required during implementation 4\ core processing workflow receive & validate alert ingest standardized alert data, perform schema validation log & reject if invalid parse alert signature generate unique alert signature string based on applicable rule definition (section 5) log failure & reject if signature cannot be generated policy lookup determine effective alertprocessingpolicy via hierarchy (section 6) use default if none found rule evaluation loop iterate through policy's alertprocessingpolicyrules by priority condition match evaluate alert data against alertprocessingrules condition logic (section 6) if no match, continue if rule matches check clearing condition evaluate alert against rule's clear condition logic if match > execute clearing logic (section 11) stop processing check flapping state evaluate recent state changes against flapping thresholds (section 7) if flapping detected > update alerts status to 'suppressed flapping' , log suppression stop processing check maintenance window evaluate against applicable maintenancewindows (section 8) check rule's ignore maintenance windows flag if suppression applies > update alerts status to 'suppressed maint' , log suppression stop processing process active alert state & thresholds query/update alerts table state (section 5) check frequency/duration thresholds (section 7) execute action (if threshold met & ready) if thresholds met and alerts action taken flag is false > execute action (section 9) update alerts action taken flag=true and potentially alerts status update alert state update alerts record (timestamps, count, final status) log & return log processing details ( auditlog , alertactionslog ) return outcome status break rule loop & stop no matching rule / fallback if loop completes, apply default behavior (e g , log alert, optionally create low priority informational ticket) log outcome return status 5\ alert signature & state management alert signature definition ( alertprocessingrules alert signature definition ) uses a json array defining source components to concatenate into the alert signature string see appendix a for schema details signature generation implementation concatenate extracted, non empty strings using pipe ( | ) recommended apply sha 256 hash to the final string and store the hex hash in alerts alert signature (text/varchar(64)) state tracking ( alerts table) uses alert signature as the key to track state key fields status , first occurred at , last occurred at , occurrence count , last status change timestamp , action taken flag , related ticket id state update logic query alerts for active match on alert signature if found, update timestamps/count if not found, create new alerts record update status based on outcomes reset action taken flag when status becomes 'resolved' state transitions ( alerts status ) defined statuses 'new', 'acknowledged', 'ticketcreated', 'actionattempted', 'actionsucceeded', 'actionfailed', 'suppressed maint', 'suppressed flapping', 'processingerror', 'resolved', 'closed', 'unknown' transitions follow workflow logic (schema constraint needs update for 'processingerror') 5 1 alert record retention & purging/archiving purpose to prevent the primary alerts table from growing indefinitely with inactive records, ensuring optimal query performance for state tracking and flapping detection target records records in the alerts table with a final inactive status (primarily 'resolved' , 'closed' ) aged records in error states ( 'unknown' , 'processingerror' ) should also be considered retention period requirement a configurable retention period (system setting, likely msp/platform level) must define how long inactive alerts remain in the primary alerts table before being eligible for purging or archiving dependency this period must be longer than the maximum possible sum of flapping threshold window seconds + flapping clear delay seconds used in any active policy, plus a safety buffer (e g , 24 hours) recommendation (resolved/closed) default retention period 90 days after resolved at or closed at this provides a balance between retaining recent history and managing table size recommendation (error states) recommend a separate, likely longer, configurable default retention period (e g , 180 days based on updated at ) for purging alerts stuck in 'unknown' or 'processingerror' statuses mechanism recommended a scheduled database job managed by the commandit platform implementation tool use the database's native scheduler (e g , pg cron ) or a platform level scheduler frequency recommend daily execution during defined off peak hours process (purge vs archive) recommendation implement purging ( delete ) as the default initial strategy due to lower complexity implement archiving (to alerts archive or data warehouse) only if long term raw alert history is a defined requirement purge implementation job executes delete statements in batches (configurable size, e g , 10000) using appropriate where clauses based on status and timestamp against configured retention periods handle error statuses separately job repeats until no more eligible rows are found in a pass or a time limit is reached archive implementation (if chosen) requires alerts archive table job uses transactional insert select ; delete ; pattern in batches requires archive lifecycle management job error handling & monitoring the cleanup job must log its own errors failures must trigger administrative alerts monitor job duration, rows processed, and alerts table size configuration retention periods (resolved/closed, errors), batch size, and purge/archive choice must be configurable commandit system settings 6\ policy & rule evaluation policy hierarchy effective alertprocessingpolicy determined by lookup alert endpoint > location > org > global default using direct fk links rule matching incoming alert data evaluated against alertprocessingrules condition logic (see appendix a) priority & conflict resolution alertprocessingpolicyrules priority dictates order the first matching rule determines the action and maintenance override behavior 7\ threshold & flapping evaluation flapping detection trigger before threshold evaluation for 'new' alerts input reads policy thresholds ( flapping threshold count , flapping threshold window seconds , flapping clear delay seconds ) history query strategy query the alerts table itself for recent status changes associated with the alert signature within the time window (using last status change timestamp ) requires effective indexing and data retention (sec 5 1) application logic analyzes the result sequence to count relevant resolved > active transitions (dedicated history table is future optimization) action if transition count >= flapping threshold count > set alerts status to 'suppressed flapping' , log suppression ( auditlog ), stop action processing clearing flap check performed when the next 'new' alert arrives for a signature in 'resolved' status compare now() alerts resolved at against flapping clear delay seconds if delay met, process as new; otherwise, keep 'resolved' and suppress incoming alert threshold evaluation trigger after rule match & passed flapping/maintenance checks input rule action parameters ( trigger after occurrences , etc ) logic compare parameters against alerts record state outcome determine if action execution should proceed 8\ maintenance window checks trigger before executing actions ('createticket', 'runscript', etc ) logic query active maintenancewindows if active window applies check ignore maintenance windows flag on matched rule action if suppression applies, set alerts status to 'suppressed maint' , log suppression ( auditlog ), skip action if override applies, log override to auditlog , proceed to action permissions setting ignore maintenance windows=true requires specific rbac permission 9\ action execution trigger rule matched, thresholds met, not suppressed, alerts action taken flag is false process execute action using action parameters (see appendix a) call commandit tools/apis (see appendix b) log attempt/outcome ( alertactionslog ) set alerts action taken flag = true update alerts status supported actions 'ignore', 'createticket', 'routetoboard', 'updateexistingci', 'runscript', 'sendnotification', 'callwebhook' 10\ clearing condition handling trigger incoming alert matches a rule's clear condition logic action find active alerts record update alerts status to 'resolved', set resolved at , reset action taken flag optionally add note to linked ticket log 'processed alertcleared' 11\ error handling & retry strategy logging log all internal engine errors comprehensively (tag 'alertengineerror') alert state update alerts status to 'processingerror' on unrecoverable error if possible return status return 'processingfailed {error}' to caller retry mechanism implement limited (max 3), exponential backoff retries for defined transient errors log retries if retries fail, log definitively, set error status, return failure retryable examples db timeout/deadlock, temp network error to internal apis, http 503/504/429 from internal apis, temp object storage unavailability non retryable examples invalid input schema, signature generation failure, config errors (invalid ids), permanent api errors (4xx except 429), script execution failure (non zero exit), policy check failure administrative alerting strategy primary configure commandit monitoringrule checking engine errors ( auditlog or alerts status) threshold e g , > 5 errors in 15 min (configurable) action sendnotification to critical admin notificationprofile / distributiongroup content recommendation include engine name, timestamp, error, signature/rule id fallback (optional) direct critical failure push to external alerting system maintenance recommend quarterly review of engine error logs and alerting thresholds 12\ logging alert state alerts table lifecycle tracking action execution alertactionslog table for initiated actions overall processing outcome engine provides disposition status to caller (logged in emailprocessinglog ) engine diagnostics detailed internal step logging (configurable level) key events maintenance overrides, flapping suppressions logged to auditlog audit log format for engine events maintenance override action type='maintenance override' , target entity type='alert' , target entity id=alert id/signature , change details={rule id, policy id, window id, suppressed action} flapping suppression action type='flapping suppression' , target entity type='alert' , target entity id=alert id/signature , change details={policy id, threshold count, threshold window} rule action executed action type='rule action executed' , target entity type='alert' , target entity id=alert id , change details={rule id, policy id, action name, action parameters, action log id} alert status change action type='alert status change' , target entity type='alert' , target entity id=alert id , change details={old status, new status, reason} engine error action type='alert engine error' , target entity type='alertprocessingengine' , target entity id=alert signature/input ref , change details={error message, processing step} 13\ user interface requirements 13 1 alert rule definitions ui list view; add/edit modal with builders for condition logic , alert signature definition , clear condition logic 13 2 alert policy definitions ui list view; add/edit modal with fields for policy details; ordered list of linked rules; link/edit rule modal with rule selector and dynamic action parameters builder (incl thresholds, flapping settings, action fields, ignore maintenance windows ) 13 3 alert policy assignment ui integrated into org/location/alert endpoint screens via dropdown selector 13 4 monitoring engine activity ui log view ( emailprocessinglog , engine logs/ auditlog ) with filtering/searching links to alerts/tickets 13 5 main alerts monitoring screen dashboard/list of active alerts key columns, filtering, sorting, actions 13 6 individual alert detail view modal/screen showing all alerts fields, context, history, notes, related items, actions 13 7 ui builder functional requirements condition/signature/action builders must be intuitive, dynamic, provide selectors, include validation, clearly represent logic 14\ scalability considerations asynchronous input use a message queue for alert ingestion stateless instances (recommended) design engine instances relying on alerts table for state address concurrency database optimization index alerts effectively partition if needed purge/archive old resolved/closed alerts regularly (as per sec 5 1) asynchronous actions offload slow actions (runscript, callwebhook, sendnotification) to separate worker queues 15\ deployment strategy phased rollout the commandit platform must support enabling new/modified alertprocessingpolicies or engine versions for specific test orgs/locations before global activation versioning consider implementing versioning for alertprocessingpolicies and alertprocessingrules to track history and facilitate rollback changes must be auditable in auditlog testing environment a dedicated staging environment closely mirroring production must be used for validation "dry run" / "report only" mode the engine must support a "dry run" mode (configurable per policy or globally) where it logs actions it would have taken without executing them engine service updates use standard blue/green or canary releases for the engine service monitor health and error rates closely post deployment appendix a detailed json schemas for rules a 1 alertprocessingrules condition logic structure // top level object { "match operator" "and | or", // required "criteria" \[ // required array of criterion objects { "field" "\<standardized input field path>", // required e g , "severity", "ciidentifiers hostname", "details parsedparameters instancename" "operator" "\<comparison operator>", // required e g , "equals", "contains", "regex match", "greater than" "value" "string | number | boolean | array\<string>", // required value for comparison "case sensitive" "boolean (optional, default false)" } // ] } a 2 alertprocessingrules alert signature definition structure \[ // array of signature component objects { "source" "'static' | 'ciidentifiers' | 'message' | 'details'", // required "value" "\<string literal (if source is 'static')>", "field" "\<field name within source object (if source != 'static' and not regex)>", "regex capture" "\<regex pattern with one capture group (if source is 'message' or 'details')>", "required" "boolean (default true)", "transform" "'lowercase' | 'uppercase' | 'none' (default 'none')" } // more component objects ] a 3 alertprocessingpolicyrules action parameters structure // base structure includes optional thresholds & clear condition { // thresholds (optional) "trigger after occurrences" "\<integer >= 1, optional>", // default 1 "trigger occurrence window seconds" "\<integer >= 0, optional>", // default 0 (no window) "trigger after duration seconds" "\<integer >= 0, optional>", // default 0 (immediate) "clear condition logic" { / optional same structure as condition logic / }, // action specific parameters (only relevant keys included based on 'action') // if action = 'createticket' "ticket template id" "\<string uuid, required>", "priority id" "\<integer, optional>", "status id" "\<string uuid, optional>", "assigned user id" "\<string uuid, optional>", "target board id" "\<string uuid, optional>", // if action = 'runscript' "script id" "\<string uuid, required>", "script parameters" { / json object, optional / }, "target device context" "'alertdevice' | 'specificdevice' | 'probedevice', optional, default 'alertdevice'", "specific device id" "\<string uuid, optional>", // required if target device context='specificdevice' // if action = 'sendnotification' "notification profile id" "\<string uuid, required>", "notification template id override" "\<string uuid, optional>", // if action = 'callwebhook' "webhook url" "\<string, required>", "webhook method" "'post' | 'get' | 'put', optional, default 'post'", "webhook payload template" { / json object/template, optional / }, "webhook headers" { / json object, optional / }, // if action = 'routetoboard' "target board id" "\<string uuid, required>", // if action = 'updateexistingci' "ci identifier field" "\<string, required>", "ci identifier regex" "\<string, optional>", "ci type" "\<string, required>", "max ticket age days" "\<integer, optional, default 30>", "update status to" "\<string uuid, optional>", "add internal note" "\<boolean, optional, default true>" // note 'ignore' action has null or empty action parameters beyond optional thresholds/clear condition } appendix b conceptual api contracts for engine actions (high level examples; formal specs like openapi/grpc needed for implementation) createticket(ticketdata dict) > {ticket id int, ticket number str} updateticket(ticketid int | ticketnumber str, updates dict) > {success bool} executermmscript(deviceid uuid, scriptid uuid, ) > {command queue id int, initial status str} (engine needs way to get final result from queue) triggernotification(profileid uuid, ) > {success bool, } callwebhook(url str, ) > {status code int, , success bool} getalertstate(alertsignature str) > dict | none (queries alerts table for active alert) updatealertstate(alertid int | alertsignature str, updates dict) > {success bool} (updates fields in alerts) createalertrecord(alertdata dict) > {alert id int} (inserts new record into alerts) checkmaintenancewindow( ) > {is active bool, } logauditevent(auditdata dict) > {success bool} (writes to auditlog) queryalerthistory(alertsignature str, ) > list\[dict] (used for flapping check)