MS SQL [Tab]

36 min

0\) big picture on each managed device (windows/linux/k8s node) agent (service/daemon) collectors sql collectors instance inventory, configuration, storage, security, error log, databases/files, jobs, metrics, ag topology, backups, db health service collectors generalized service inventory + os/k8s/cloud status, endpoints, and service level metrics for sql engine/agent/browser and any other product (iis, redis, nginx, kafka, etc ) synthetic checks (optional) tcp/http probes for exposed endpoints scheduler cadence tiers (discovery / snapshots / metrics) local sqlite cache mirrors of selected server tables (trimmed types) lookup catalogs mirror (ref ) with code→id mapping identity map (natural key→uuid), outbox (deltas), collection ledger (when/what ran) sync client batches idempotent ops to server server (postgresql + api) catalog endpoints to deliver ref lookup rows to agents (by code) ingest api (idempotent upserts; resolves lookup codes→ids ) merge/workers to compute rollups, findings, kpis, and alerts security mutual tls or signed bearer; store sql creds in os keystore (dpapi/libsecret), never plaintext 1\) lookup catalogs (how agents use them) source of truth server ref tables (service kind/product/provider/platform/environment, endpoint protocol/type, metric name, health status, service status, start mode) agent boot call get /v1/catalog?since=… to download active ref cache them in sqlite (ref mirrors) index by code and store the server id during collection always emit lookup values as code in payloads (e g , product='mssql', service status='running', metric name='cpupct') server resolves code→id if a new code appears, server auto creates an active ref row (or rejects if locked down, your choice) and returns its id benefits agents don’t break when catalogs evolve; tenant overrides still work (server maps to correct per org row) 2\) schedules & cadences tier interval (default) what runs discovery on start & hourly sql instance discovery, databases list, service discovery (engine/agent/browser/other services), endpoints, ag membership snapshots 15–60 min sql config/storage/security/error log/backup coverage/db health; service status & endpoints metrics 1–5 min sql perf metrics (cpu/ple/bufferhit/batchreq); service metrics (status, cpu, memory, restarts, port listening, latency) heavy/nightly 24h (off peak) fragmentation sampling, extended security audits, long error log scans 3\) data collection (what to run and where it goes) 3 1 service discovery & status (generic) windows (scm/wmi/powershell) enumerate get service 'mssql ','sql ','sqlbrowser','ssrs ','ssis ','ssas ' (+ any extra services you support) details win32 service → name (service name), displayname, pathname (binary path), startmode (start mode), state→ runtime status process map service → pid; get process id for cpu %, workingset mb, handles, threads ports get nettcpconnection or netstat ano; join by pid to list endpoints restarts eventlog (7031/7034/7036) in last 24h → crash count 24h linux (systemd) systemctl show \<unit> activestate, execmainpid, execmainstarttimestamp, fragmentpath /proc/$pid or ps o pcpu,rss for cpu/mem ss ltnup for port listeners; join by pid kubernetes list deployments/statefulsets/pods; compute service runtime status from pod phases & container restarts (crash count 24h); capture ports from container specs; store platform='kubernetes', cluster name, namespace cloud paas (e g , aws rds/azure sql) use cloud apis (describedbinstances/etc ) to map status and metadata; start mode='notapplicable'; no local pid persist → monitored service natural key (agent) host bound device id|service name cloud organization id|provider|region|service name columns taxonomy via lookup ids (resolved server side from code), identity fields, runtime service status id, start mode id, pid, uptime seconds, last start time, crash count 24h, listen endpoints json (quick view), health rollup (see §5), detail json normalize endpoints (optional) → monitored service endpoint for each port upsert (service id, protocol id, port, ip, name, endpoint type id, tls enabled, path) keep the array in listen endpoints too for quick reads; use the table when you need filtering/alerting by port/protocol service metrics → service health metric emit per metric servicestatus (1/0) cpupct, memoryrssmb, threads, handles restarts24h portlistening (per endpoint; store port/protocol in extra) optional synthetic http 2xxrate, latencyms, tlsexpirydays evaluate status via thresholds (see §5) and store status id + threshold snapshot link to sql domain for sql engine/agent/browser, one of set monitored service instance id (if discovered sql instance exists), and/or add a service binding row with bound entity type='sql server instance' & bound entity id=… 3 2 sql instance inventory → sql server instance use t sql (minimal round trips) \ version, edition, collation, hadr flags select serverproperty('servername') as server name, serverproperty('machinename') as machine name, serverproperty('productversion') as product version, serverproperty('productlevel') as product level, serverproperty('edition') as edition, serverproperty('engineedition') as engine edition, serverproperty('collation') as collation server, serverproperty('isclustered') as is clustered, serverproperty('ishadrenabled') as is hadr enabled; uptime select sqlserver start time from sys dm os sys info; cpu/numa/hardware snapshot select cpu count, hyperthread ratio, scheduler count, numa node count from sys dm os sys info; configuration highlights select name, value in use from sys configurations where name in ( 'min server memory (mb)','max server memory (mb)', 'max degree of parallelism','cost threshold for parallelism', 'backup compression default','optimize for ad hoc workloads', 'clr enabled','xp cmdshell','remote admin connections', 'filestream access level','contained database authentication' ); ports windows registry supersocketnetlib\tcp\ipall\tcpport linux /var/opt/mssql/mssql conf or select local tcp port from sys dm exec connections where session id = @@spid; persist upsert (device id, instance name); fill columns; put extras in discovery detail also populate sql instance config snapshot (full sys configurations into config map + default paths), sql instance endpoint (tds/dac/hadr), sql instance storage summary , sql instance security summary , sql instance errorlog summary 3 3 databases & files sql database name/status/owner/size/recovery/collation/tde/created at + latest backup times from msdb backupset (join by db name) sql database file per db file file id, logical/physical, type, size/max, autogrowth settings 3 4 agent jobs sql agent job metadata + last/next run (from msdb sysjobs, sysjobhistory, sysjobschedules), steps/schedule in json (optional) sql agent job run one row per new outcome (step id=0) with start/finish/outcome; keep excerpt of messages 3 5 sql metrics → sql instance metric sample every 1–5 min cpu %, memory %, ple (sec), buffer cache hit ratio %, batch requests/sec evaluate with thresholds (see §5) → status (if you later migrate to lookup based metrics here, send metric name codes that match ref metric name code ) 3 6 ag topology sql availability group, sql availability replica from sys availability groups, sys availability replicas, sys dm hadr views 3 7 backup coverage snapshot sql backup status snapshot per instance rollup of full/diff/log backups in last 24h, max full age, lists of dbs missing coverage 3 8 db health snapshot checkdb recency (parse errorlog or maintenance tables), fragmentation (sampled dm db index physical stats), dbcc sqlperf(logspace) for log %, approximate data free % persist → sql database health snapshot 4\) local sqlite design (agent) pragmas pragma journal mode=wal; pragma synchronous=normal; pragma foreign keys=on; core meta identity map(entity type text, natural key text, id text primary key(entity type,natural key)) outbox(id integer pk autoincrement, entity type text, entity id text, op text check(op in ('upsert','delete')), payload text, created at text) collection ledger(collector text primary key, last started at text, last finished at text, last success at text, last error text) lookup mirrors tables mirroring ref with id text , code text, label, active, plus needed joins (e g , ref metric name with default unit, default comparator) index by code agent always uses code internally mirrors of domain tables subset of columns, json fields as text, booleans as 0/1, timestamps as iso 8601 text in utc add a hash text (sha256 over canonical json sans audit) to each mutable mirror to detect deltas natural keys (agent) monitored service host → device id|service name, cloud → organization id|provider|region|service name monitored service endpoint service id|protocol|port|ip|path service health metric append only (service id|collected at|metric name) sql server instance device id|instance name sql database instance id|name sql database file database id|file id sql agent job instance id|name snapshots & metrics append only (use (parent id, collected at\[, metric name]) uniqueness) delta algorithm build canonical json (sorted keys; drop id/created at/updated at/active) compute hash if new or hash changed → enqueue upsert to outbox if entity disappeared → enqueue delete (or set active=0 locally) retention metrics keep n (e g , 10k) rows snapshots 30 days job runs 30–90 days daily trim task 5\) thresholds, statuses, and health rollups catalog all metric names live in ref metric name(code, default unit, default comparator, …) agent emits metrics using code ; server resolves to id threshold table (server) use your existing sql metric threshold and treat metric name as a code that matches ref metric name code scope json can point to an instance or a service for sql metrics {"instance id" "…"} for service metrics {"service id" "…"} or by product/service kind (you can expand to accept {"product" "mssql"}) evaluation for each sample (sql or service), fetch applicable threshold (most specific wins), apply default comparator unless overridden in row produce status/status id and store threshold snapshot with the exact numbers used service health rollup (for monitored service health status id) rule of thumb (compute at ingest) if service status != running → rollup = critical else if any metric critical in last n minutes → critical else if any metric warning in last n minutes → warning else → good persist rollup on the service row for fast list views 6\) sync protocol (agents → server) batch envelope (example) { "device id" "uuid", "organization id" "uuid", "batch id" "uuid", "sent at" "2025 08 19t07 00 00z", "ops" \[ { "op id" "uuid", "op" "upsert", "entity" "monitored service", "id" "uuid", // agent stable "natural key" "device|mssqlserver", "payload" { "organization id" "uuid", "device id" "uuid", "service name" "mssqlserver", "product code" "mssql", // lookup codes "service kind code" "database", "provider code" "onprem", "platform code" "windows", "environment code" "prod", "service status code" "running", "start mode code" "auto", "pid" 1234, "uptime seconds" 86400, "listen endpoints" \[ {"protocol code" "tcp","ip" " ","port" 1433,"name" "tds"} ], "detail" { "svcaccount" "nt service\\\mssqlserver" } } }, { "op id" "uuid", "op" "upsert", "entity" "service health metric", "id" "uuid", "payload" { "service id" "uuid", "collected at" "2025 08 19t07 00 00z", "metric name code" "cpupct", "value num" 12 4, "unit" "%", "status code" "good", "threshold snapshot" { "comparator" ">", "warning" 80, "critical" 90 } } } ] } server ingest resolve codes→ids using ref caches; auto create unknown codes if allowed enforce natural key uniqueness with on conflict on server tables monitored service (organization id, coalesce(device id,uuid zero), service name) other entities per their unique constraints idempotency ignore duplicate batch id/op id delete ops → soft delete (active=false) and updated at=now() ack & cleanup return accepted op ids\[] + rejected\[] agent deletes accepted from outbox, logs rejected with reason, and retries after exponential backoff 7\) what table stores what (cheat sheet) generic service layer monitored service one row per os/cloud/k8s service (engine, agent, browser, iis, redis, etc ) holds taxonomy (via lookups), runtime service status , start mode, pid/uptime, endpoints (inline json), overall health rollup , and extended details monitored service endpoint normalized per endpoint rows (protocol/port/ip/path/tls) use when filtering, probing, or alerting by endpoint service health metric time series for service level kpis (status, cpu, memory, restarts, port listening, http latency/codes, etc ) service binding polymorphic links to domain objects (e g , sql server instance, rds resource, k8s deployment) sql domain sql server instance instance identity, version, edition, sp configure highlights, connectivity, hw/licensing, ag flags sql instance config snapshot full sys configurations + default paths sql instance storage summary tempdb, drive space summaries, growth policy sql instance security summary sa status, principals, tde, auditing sql instance errorlog summary 24h counts + last error, sample lines sql instance endpoint tds/dac/hadr endpoints (sql aware) sql database & sql database file database catalog and files sql agent job & sql agent job run agent jobs & sampled run history sql instance metric sample sql perf counters (cpu, ple, bufferhit, batch/s) sql availability group & sql availability replica ag topology sql backup status snapshot backup coverage rollup sql database health snapshot checkdb/fragmentation/capacity sql metric threshold thresholds for evaluating samples (both sql & service metrics via scope) (optional) sql instance daily kpi aggregated kpis per day lookup catalogs ref service taxonomy, providers, platforms, environments; endpoint protocol/type; metric names; status/start mode catalogs; (sql specific recovery model/db status/hadr roles if you choose to normalize them later) 8\) mapping rules (agent → server) from raw values to lookup codes service service kind code ‘database’ for sql engine/agent/browser; other products accordingly product code ‘mssql’, ‘redis’, ‘iis’, ‘nginx’, etc provider code ‘onprem’ | ‘aws’ | ‘azure’ | … platform code ‘windows’ | ‘linux’ | ‘kubernetes’ | ‘cloudservice’ environment code customer tagged (‘prod’/’dev’/…) service status code map from os states (running/stopped/…) start mode code map from scm/systemd modes (auto/manual/disabled/notapplicable) endpoint protocol code ‘tcp’ | ‘udp’ | ‘http’ | ‘https’ | ‘grpc’ | … endpoint type code ‘tsql’ | ‘dac’ | ‘hadr’ | ‘http’ | … metrics metric name code ‘cpupct’, ‘memoryrssmb’, ‘portlistening’, ‘http 5xxrate’, ‘servicestatus’, ‘batchrequestspersec’, etc status code ‘good’ | ‘warning’ | ‘critical’ | ‘unknown’ (after evaluation) joining service↔sql prefer setting monitored service instance id for mssql engine/agent; additionally create a service binding if you need many to one semantics (e g , a proxy service that “backs” multiple instances) 9\) error handling, resilience, and guardrails collectors must degrade gracefully (e g , missing permission to read errorlog) and write messages to collection errors on the snapshot table sql connectivity failures set sql server instance connectable=false, still populate service rows (engine/agent status) so you can alert that sql isn’t reachable while os services might be up/down backoff on high cpu or long running queries; enforce per collector time budgets (e g , 30–60s) protect secrets (never send creds in payloads) sign/verify batches (hmac device key) 10\) retention & rollups sqlite metrics 10k rows per entity; snapshots 30 days server raw metrics >90 days → downsample (hourly/daily) job runs & error logs 90–180 days (configurable) snapshots 180–365 days daily kpis 13 months 11\) testing checklist catalog sync unknown codes are created or rejected per policy; agent can continue identity stability instance rename, service rename, ag failover; natural keys still map to same uuid or intentionally create new rows per design delta hashing ensure stable canonicalization (sorted keys; normalized numbers/booleans) to avoid false positives load 1 host with 3 instances, 100+ dbs, 400 files, 200 jobs, metrics at 1 min cadence — outbox size & server ingest throughput ok chaos stop/start engine/agent; crash/restart spikes; block t sql — service layer continues to populate & alert cross product add a non sql service (redis) verify monitored service + endpoints + metrics flow without any sql dependencies 12\) example “first run” flow (sql + services) catalog pull → cache ref by code service discovery create/upsert monitored service rows for mssql engine (service name='mssqlserver', product='mssql', service kind='database', service status='running', start mode='auto', listeners \[tcp 1433]) create rows for sqlagent and sqlbrowser (link instance id if known) sql discovery connect to sql; write sql server instance and supporting snapshots/endpoints link engine service to instance id metrics (minutely) emit sql metrics (sql instance metric sample) and service metrics (service health metric) threshold evaluate → status codes sync flush outbox in batches; server resolves codes→ids, upserts, and computes health rollups nightly run fragmentation sampling + checkdb recency parse; write sql database health snapshot 13\) implementation tips & gotchas normalize what matters keep endpoints in both json (fast ui) and rows (alerting/probing) use views for legacy api if you later swap enum text → lookup ids in sql domain tables avoid duplicate service rows on windows named instances use precise service names (mssql$foo, sqlagent$foo) as the identity dac port often unset; handle null error log via xp readerrorlog may require elevated rights; fallback to filesystem read (windows service account permissions) or extended events kubernetes for multi pod services, you can either track a single logical monitored service for the deployment/statefulset (aggregate metrics), or track per pod services and bind them; pick one approach and stay consistent appendix a — minimal sql for key collectors (reference) instance uptime/version sys dm os sys info, serverproperty config snapshot sys configurations; default paths via serverproperty('instancedefault path') endpoints sys endpoints, sys tcp endpoints (tds/dac/hadr) storage sys dm os volume stats, sys master files, tempdb sys database files security sys server principals, sys server role members, sys sql logins, sys dm server audit status databases sys databases + backup info from msdb backupset files sys database files agent jobs msdb sysjobs, sysjobschedules, sysjobhistory metrics sys dm os performance counters, sys dm os ring buffers (or xe/perf counters as you prefer) ag sys availability groups, sys availability replicas, sys dm hadr log space dbcc sqlperf(logspace)