Incident Management
When a service crashes,
HostAtlas is already investigating.
Automatic crash detection with full log collection, system metric snapshots, and timeline reconstruction. The agent detects the problem, gathers the evidence, and creates a complete incident report — before you even open your laptop. From detection to resolution in one platform.
Detection
<30s
from crash to alert
Log window
7 min
5 before + 2 after
Metrics
Snapshot
CPU, RAM, disk at crash
Follow-up
Auto
+2 min post-crash logs
Automatic Detection
Five steps. Fully automatic. Zero human intervention.
When the agent detects a monitored service process has disappeared, it kicks off a structured incident response pipeline. Every step happens on the server, in real time, before the data is sent to the platform.
Step 1
Detect
Agent detects service process disappearance within 30 seconds.
Step 2
Collect Logs
Gathers error logs, syslog, dmesg from 5 min before to 2 min after crash.
Step 3
Snapshot Metrics
Captures CPU, RAM, disk metrics at crash time. Highlights values above 90%.
Step 4
Create & Alert
Creates incident with all data correlated. Sends alerts to your notification channels.
Step 5
Follow-up
Collects follow-up logs 2 minutes later to capture restart attempts and recovery.
Log Collection
Service-specific log collection. Automatic and comprehensive.
When an incident is triggered, the agent knows exactly which logs to collect based on the service that crashed. Each service type has a tailored collection profile that gathers the most relevant diagnostic data.
MySQL
MySQL Error Log
Primary error log from the MySQL data directory. Captures crash traces, InnoDB errors, replication failures, and startup/shutdown messages.
Syslog
System log entries filtered to mysql-related messages. Captures systemd unit start/stop events and OOM killer activity.
dmesg (kernel)
Kernel ring buffer filtered for OOM kills, segfaults, and hardware errors that may have caused the crash.
Resource Metrics
CPU, RAM, disk, and I/O metrics snapshot at the moment of crash detection.
PostgreSQL
PostgreSQL Logs
Main PostgreSQL server log. Captures FATAL errors, PANIC messages, WAL corruption, connection limit exceeded, and lock deadlocks.
Syslog
System log entries filtered to postgres-related messages. Tracks process manager events and automatic restart attempts.
dmesg (kernel)
Kernel-level events including OOM kills targeting postgres processes and shared memory errors.
Nginx
Nginx Error Log
Main error log capturing upstream connection failures, worker process crashes, configuration errors, and SSL handshake failures.
Syslog
System log entries for nginx master/worker process lifecycle events and signal handling.
dmesg (kernel)
Kernel events related to nginx worker memory limits and network subsystem errors.
Redis
Redis Server Log
Redis server log capturing memory limit warnings, RDB/AOF persistence errors, replication disconnects, and slow log entries near crash time.
Syslog
System log entries for Redis process management, including sentinel failover events and systemd restarts.
dmesg (kernel)
Kernel OOM events targeting Redis, transparent huge page warnings, and overcommit memory issues.
Apache
Apache Error Log
Main error log with module failures, segfaults from child processes, MaxRequestWorkers saturation, and PHP-FPM connection errors.
Syslog
System log entries for Apache process control, graceful restarts, and child process management events.
dmesg (kernel)
Kernel-level events for Apache child process memory pressure and network socket exhaustion.
General / Other
Syslog (filtered)
System log entries filtered to the specific service name. Captures process lifecycle events, restart attempts, and dependency failures.
dmesg (filtered)
Kernel ring buffer filtered for the service process. Catches OOM kills, segmentation faults, and hardware-related terminations.
Resource Metrics
Full system metric snapshot — CPU, RAM, disk, swap, load average, and network I/O at the moment of crash detection.
Automatic log path discovery
Log paths are automatically discovered during the agent's service detection phase. The agent parses service configurations (e.g., my.cnf, nginx.conf, postgresql.conf) to find configured log file locations. No manual log path configuration is required — the agent already knows where to look.
Incident Detail
Everything you need to diagnose. On one page.
Every incident gets a dedicated detail page with the crash timeline, collected logs, metric snapshots, affected infrastructure, and team comments. No SSH required. No context switching.
MySQL service crashed on prod-db-01
Detected Mar 21, 2026 at 02:14:37 UTC · Duration: 4m 22s · Auto-detected
Metrics at Crash Time
CPU
94.2%
arrow_upward Above 90% threshold
RAM
97.8%
arrow_upward Above 90% threshold
Disk /
41.3%
Normal
Load Avg (1m)
12.4
4 cores · elevated
Event Timeline
02:09:37 UTC
CPU utilization exceeds 90% on prod-db-01
02:12:15 UTC
RAM utilization exceeds 95%. Swap usage increasing rapidly.
02:14:37 UTC
MySQL service process (pid 1842) no longer detected. Crash confirmed.
02:14:40 UTC
Incident INC-2847 created. Logs collected (387 lines). Metrics snapshot captured.
02:14:45 UTC
Alerts sent: Slack (#ops-alerts), Email (oncall@company.com), PagerDuty (Database Team)
02:15:02 UTC
systemd attempting restart of mysql.service (Restart=on-failure)
02:16:42 UTC
Follow-up collection: MySQL (pid 4201) detected — service recovered. InnoDB recovery complete.
Collected Logs
Affected Infrastructure
Server
prod-db-01
192.168.1.42 · Ubuntu 22.04
Service
MySQL 8.0.35
Port 3306 · /var/lib/mysql
Dependent Domains
3 domains
example.com, api.example.com, +1
Comments & Updates
Incident created automatically. MySQL service (pid 1842) disappeared on prod-db-01. CPU at 94.2%, RAM at 97.8% — both above 90% threshold. OOM killer likely involved. 387 log lines collected across 3 sources.
Looking at the logs — InnoDB buffer pool allocation failed before the OOM kill. Likely the analytics batch job running at 02:00 caused the memory spike. Increasing innodb_buffer_pool_size limit and adding a resource limit for the batch job.
Follow-up: MySQL service (pid 4201) detected. Service recovered via systemd restart. InnoDB recovery completed successfully.
Severity & Status
Structured severity levels. Clear status workflow.
Every incident has a severity level and a status that reflects where it is in the resolution lifecycle. Severity determines notification urgency. Status tracks your team's progress from detection to resolution.
Alerting Integration
Incidents trigger your entire notification pipeline.
When an incident is created — whether automatically or manually — it flows through the same alerting infrastructure as your metric alerts. Every configured notification channel receives the incident, respecting your routing rules, escalation policies, and quiet hours.
Get started
Stop grepping through logs at 3 AM.
Install the HostAtlas agent and let automatic incident detection do the investigation for you. When a service crashes, you get a complete incident report with logs, metrics, timeline, and affected infrastructure — delivered to your Slack, email, or PagerDuty within seconds. Start with the free plan. No credit card required.
Quick install
$ curl -sSL install.hostatlas.app | bash_