Incident Management

When a service crashes,
HostAtlas is already investigating.

Automatic crash detection with full log collection, system metric snapshots, and timeline reconstruction. The agent detects the problem, gathers the evidence, and creates a complete incident report — before you even open your laptop. From detection to resolution in one platform.

Detection

<30s

from crash to alert

Log window

7 min

5 before + 2 after

Metrics

Snapshot

CPU, RAM, disk at crash

Follow-up

Auto

+2 min post-crash logs

Automatic Detection

Five steps. Fully automatic. Zero human intervention.

When the agent detects a monitored service process has disappeared, it kicks off a structured incident response pipeline. Every step happens on the server, in real time, before the data is sent to the platform.

crisis_alert

Step 1

Detect

Agent detects service process disappearance within 30 seconds.

description

Step 2

Collect Logs

Gathers error logs, syslog, dmesg from 5 min before to 2 min after crash.

analytics

Step 3

Snapshot Metrics

Captures CPU, RAM, disk metrics at crash time. Highlights values above 90%.

campaign

Step 4

Create & Alert

Creates incident with all data correlated. Sends alerts to your notification channels.

update

Step 5

Follow-up

Collects follow-up logs 2 minutes later to capture restart attempts and recovery.

crisis_alert

Step 1

Crash Detection

The HostAtlas agent monitors every discovered service process using operating system process tables. Every 15 seconds, the agent checks whether each tracked process is still running. When a monitored service disappears — whether it crashes, is killed by the OOM killer, or terminates unexpectedly — the agent detects it within a single heartbeat cycle.

Detection is based on process presence, not port checks. This means the agent catches crashes even when the process exits cleanly but unexpectedly, or when a service enters a zombie state that still holds the port open.

# Agent log output
[WARN] Service "mysql" (pid 1842) no longer detected
[ALERT] Initiating incident response pipeline for mysql
[INFO] Crash timestamp: 2026-03-21T02:14:37Z
description

Step 2

Log Collection

Immediately after detection, the agent collects logs from multiple sources. It gathers service-specific error logs (e.g., MySQL error log, nginx error log), the system syslog, and kernel dmesg output. The collection window spans from 5 minutes before the crash to 2 minutes after, capturing the full context of what happened leading up to and immediately following the failure.

Log paths are automatically discovered during the agent's service detection phase — no manual log path configuration is required. The agent knows where each service writes its logs based on the service type and operating system conventions.

# Log collection summary
[INFO] Collecting logs: -5m to +2m window
[OK]  mysql error.log: 47 lines captured
[OK]  /var/log/syslog: 312 lines (filtered)
[OK]  dmesg: 28 lines (filtered)
analytics

Step 3

Metric Snapshot

The agent captures a point-in-time snapshot of all system metrics at the moment of the crash. This includes CPU utilization (per-core and aggregate), memory usage (used, cached, buffers, swap), disk usage per mount point, network I/O rates, and system load averages.

Any metric value that exceeds 90% of capacity is automatically flagged as a potential contributing factor. This helps you instantly identify whether the crash was caused by resource exhaustion — an OOM kill, a full disk, or CPU saturation.

Snapshot at crash time

CPU 94.2%
RAM 97.8%
Disk / 41.3%
Disk /var 67.9%

warning CPU and RAM above 90% threshold — flagged as potential cause

campaign

Step 4

Create & Alert

The agent sends the complete incident payload to the HostAtlas platform — crash timestamp, collected logs, metric snapshot, and affected infrastructure references. The platform creates a structured incident record, assigns severity based on the service type and metric thresholds, and triggers notifications through all configured channels.

Notifications include a direct link to the incident detail page where your team can immediately review all collected evidence. The incident is automatically linked to the affected server, service, and any domains that depend on that service.

Slack — #ops-alerts

MySQL crashed on prod-db-01. CPU: 94.2%, RAM: 97.8%. View incident →

mail

Email — oncall@company.com

[Critical] MySQL service down on prod-db-01 — incident INC-2847

pager

PagerDuty — Database Team

High urgency incident triggered. Escalation policy active.

update

Step 5

Follow-up Collection

Two minutes after the initial crash detection, the agent performs a follow-up collection. It gathers a second round of logs from the same sources to capture what happened after the crash — restart attempts by systemd, watchdog recovery, error messages from dependent services, or continued resource pressure.

The follow-up data is appended to the existing incident as a new timeline entry. If the service has recovered by the follow-up check, the incident is updated to reflect that the service is back online, and a recovery notification is sent.

# Follow-up at +2 minutes
[INFO] Follow-up collection initiated
[OK]  syslog: systemd restarted mysql.service
[OK]  mysql error.log: InnoDB recovery complete
[OK]  Service "mysql" (pid 4201) detected — recovered
[INFO] Incident updated: service recovered. Sending recovery notification.
timeline

End to end

Complete Pipeline Timeline

The entire pipeline — from detection to incident creation — completes in under 30 seconds. The follow-up collection adds additional context 2 minutes later. Your team receives a fully-formed incident with all the evidence needed to diagnose the root cause, without ever SSH-ing into the server.

Crash detected
T+0s
Logs collected (7-min window)
T+3s
Metrics snapshot captured
T+5s
Incident created & alerts sent
T+8s
Follow-up logs collected
T+2m

Log Collection

Service-specific log collection. Automatic and comprehensive.

When an incident is triggered, the agent knows exactly which logs to collect based on the service that crashed. Each service type has a tailored collection profile that gathers the most relevant diagnostic data.

SQL

MySQL

description

MySQL Error Log

Primary error log from the MySQL data directory. Captures crash traces, InnoDB errors, replication failures, and startup/shutdown messages.

terminal

Syslog

System log entries filtered to mysql-related messages. Captures systemd unit start/stop events and OOM killer activity.

memory

dmesg (kernel)

Kernel ring buffer filtered for OOM kills, segfaults, and hardware errors that may have caused the crash.

monitoring

Resource Metrics

CPU, RAM, disk, and I/O metrics snapshot at the moment of crash detection.

PG

PostgreSQL

description

PostgreSQL Logs

Main PostgreSQL server log. Captures FATAL errors, PANIC messages, WAL corruption, connection limit exceeded, and lock deadlocks.

terminal

Syslog

System log entries filtered to postgres-related messages. Tracks process manager events and automatic restart attempts.

memory

dmesg (kernel)

Kernel-level events including OOM kills targeting postgres processes and shared memory errors.

NGX

Nginx

description

Nginx Error Log

Main error log capturing upstream connection failures, worker process crashes, configuration errors, and SSL handshake failures.

terminal

Syslog

System log entries for nginx master/worker process lifecycle events and signal handling.

memory

dmesg (kernel)

Kernel events related to nginx worker memory limits and network subsystem errors.

RDS

Redis

description

Redis Server Log

Redis server log capturing memory limit warnings, RDB/AOF persistence errors, replication disconnects, and slow log entries near crash time.

terminal

Syslog

System log entries for Redis process management, including sentinel failover events and systemd restarts.

memory

dmesg (kernel)

Kernel OOM events targeting Redis, transparent huge page warnings, and overcommit memory issues.

APH

Apache

description

Apache Error Log

Main error log with module failures, segfaults from child processes, MaxRequestWorkers saturation, and PHP-FPM connection errors.

terminal

Syslog

System log entries for Apache process control, graceful restarts, and child process management events.

memory

dmesg (kernel)

Kernel-level events for Apache child process memory pressure and network socket exhaustion.

settings

General / Other

terminal

Syslog (filtered)

System log entries filtered to the specific service name. Captures process lifecycle events, restart attempts, and dependency failures.

memory

dmesg (filtered)

Kernel ring buffer filtered for the service process. Catches OOM kills, segmentation faults, and hardware-related terminations.

monitoring

Resource Metrics

Full system metric snapshot — CPU, RAM, disk, swap, load average, and network I/O at the moment of crash detection.

info

Automatic log path discovery

Log paths are automatically discovered during the agent's service detection phase. The agent parses service configurations (e.g., my.cnf, nginx.conf, postgresql.conf) to find configured log file locations. No manual log path configuration is required — the agent already knows where to look.

Incident Detail

Everything you need to diagnose. On one page.

Every incident gets a dedicated detail page with the crash timeline, collected logs, metric snapshots, affected infrastructure, and team comments. No SSH required. No context switching.

INC-2847 Critical Investigating

MySQL service crashed on prod-db-01

Detected Mar 21, 2026 at 02:14:37 UTC · Duration: 4m 22s · Auto-detected

Metrics at Crash Time

CPU

94.2%

arrow_upward Above 90% threshold

RAM

97.8%

arrow_upward Above 90% threshold

Disk /

41.3%

Normal

Load Avg (1m)

12.4

4 cores · elevated

Event Timeline

02:09:37 UTC

CPU utilization exceeds 90% on prod-db-01

02:12:15 UTC

RAM utilization exceeds 95%. Swap usage increasing rapidly.

02:14:37 UTC

MySQL service process (pid 1842) no longer detected. Crash confirmed.

02:14:40 UTC

Incident INC-2847 created. Logs collected (387 lines). Metrics snapshot captured.

02:14:45 UTC

Alerts sent: Slack (#ops-alerts), Email (oncall@company.com), PagerDuty (Database Team)

02:15:02 UTC

systemd attempting restart of mysql.service (Restart=on-failure)

02:16:42 UTC

Follow-up collection: MySQL (pid 4201) detected — service recovered. InnoDB recovery complete.

Collected Logs

2026-03-21T02:09:41Z [Note] InnoDB: Buffer pool hit rate 812 / 1000
2026-03-21T02:10:15Z [Note] InnoDB: page_cleaner: 1000ms intended loop took 4523ms
2026-03-21T02:11:02Z [Warning] InnoDB: Over 67% of the buffer pool is occupied by lock heaps
2026-03-21T02:12:18Z [Warning] Aborted connection 4821 to db: 'production' (Got timeout reading communication)
2026-03-21T02:12:44Z [Warning] Aborted connection 4834 to db: 'production' (Got timeout reading communication)
2026-03-21T02:13:01Z [ERROR] InnoDB: Cannot allocate memory for the buffer pool
2026-03-21T02:13:22Z [ERROR] InnoDB: mmap(137363456 bytes) failed; errno 12
2026-03-21T02:14:01Z [ERROR] InnoDB: Unable to lock ./ibdata1 error 11
2026-03-21T02:14:37Z [ERROR] mysqld got signal 9 (SIGKILL)
2026-03-21T02:14:37Z [ERROR] Server shutdown complete (mysqld 8.0.35)
47 lines · 02:09:37 — 02:16:37 UTC Download full log

Affected Infrastructure

dns

Server

prod-db-01

192.168.1.42 · Ubuntu 22.04

storage

Service

MySQL 8.0.35

Port 3306 · /var/lib/mysql

language

Dependent Domains

3 domains

example.com, api.example.com, +1

Comments & Updates

HA
HostAtlas Auto-generated · 02:14:45 UTC

Incident created automatically. MySQL service (pid 1842) disappeared on prod-db-01. CPU at 94.2%, RAM at 97.8% — both above 90% threshold. OOM killer likely involved. 387 log lines collected across 3 sources.

JD
Jane Doe 02:22:10 UTC

Looking at the logs — InnoDB buffer pool allocation failed before the OOM kill. Likely the analytics batch job running at 02:00 caused the memory spike. Increasing innodb_buffer_pool_size limit and adding a resource limit for the batch job.

HA
HostAtlas Auto-generated · 02:16:42 UTC

Follow-up: MySQL service (pid 4201) detected. Service recovered via systemd restart. InnoDB recovery completed successfully.

Add a comment...
Markdown supported

Manual Incidents

Not every incident happens
on your servers.

DNS outages. CDN failures. Third-party API downtime. Cloud provider issues. Some incidents originate outside your infrastructure but still need to be tracked, communicated, and resolved. HostAtlas lets you create manual incidents for any external issue that affects your services.

add_circle

Create from the dashboard

One-click incident creation with a title, description, severity level, and optional infrastructure links. Use it for DNS propagation issues, CDN purge failures, or upstream provider outages.

api

Create via the API

Use the REST API to create incidents programmatically from your own monitoring scripts, CI/CD pipelines, or third-party tools. Attach custom metadata and link to affected servers.

link

Link to infrastructure

Associate manual incidents with specific servers, services, or domains. This creates a complete incident history for each resource, regardless of whether the issue was detected automatically or reported manually.

web

Status page integration

Manual incidents can be published to your public status page, keeping customers and stakeholders informed about issues that affect service availability.

Create Manual Incident close
Cloudflare DNS propagation delay — EU region
Cloudflare reporting DNS resolution delays in EU-WEST-1 and EU-CENTRAL-1 regions. Our domains hosted behind Cloudflare may experience intermittent resolution failures.
Warning
expand_more
Investigating expand_more
language example.com
close
language api.example.com
close
+ Link server, service, or domain

Severity & Status

Structured severity levels. Clear status workflow.

Every incident has a severity level and a status that reflects where it is in the resolution lifecycle. Severity determines notification urgency. Status tracks your team's progress from detection to resolution.

Severity Levels

Critical

A core service has crashed or become completely unresponsive. Production traffic is affected. Immediate attention required.

notifications_active All channels alerted speed No cooldown escalator_warning Escalation triggered
Warning

A service is degraded or a non-critical component has failed. Performance may be impacted but the system is still operational. Investigation recommended.

notifications Primary channels alerted timer Cooldown respected
Info

An informational incident that does not require immediate action. Service recovered automatically or the issue is cosmetic. Logged for record-keeping and post-mortem analysis.

mail Email notification only schedule Logged for review
info

Auto-detection severity: When incidents are created automatically, severity is assigned based on the service type and metric thresholds. Core database services default to Critical. Web servers default to Warning. Services with metric values above 90% are escalated to Critical regardless of type.

Status Workflow

Open

The incident has been created and no one has begun investigation. This is the initial state for all auto-detected incidents. Notifications have been sent but no team member has claimed the incident yet.

Investigating

A team member has acknowledged the incident and is actively investigating. The incident is assigned to a specific user. Status page is updated to show the issue is being looked at.

arrow_downward
Identified

The root cause has been identified and a fix is being implemented. Your team knows what went wrong. Status page is updated with the identified cause and expected resolution time.

arrow_downward
Monitoring

A fix has been applied and the team is monitoring to confirm the issue is resolved. The service is back online but the incident remains open while stability is confirmed.

arrow_downward
Resolved

The incident is closed. The root cause has been addressed, the service is stable, and the team has confirmed recovery. A resolution notification is sent through all channels and the status page is updated.

info

Auto-recovery: When the follow-up collection detects that a crashed service has recovered, the incident status is automatically updated and a recovery notification is sent. Your team can still investigate the root cause to prevent recurrence.

Alerting Integration

Incidents trigger your entire notification pipeline.

When an incident is created — whether automatically or manually — it flows through the same alerting infrastructure as your metric alerts. Every configured notification channel receives the incident, respecting your routing rules, escalation policies, and quiet hours.

Notification Flow

crisis_alert

Incident Created

An auto-detected or manual incident enters the alerting pipeline. The platform evaluates severity, affected resources, and your notification configuration.

arrow_downward
route

Routing & Escalation

The alert is routed based on your team's configuration. Critical incidents bypass cooldown periods and trigger escalation policies. Warning-level incidents respect cooldown timers. Quiet hours can suppress non-critical notifications.

arrow_downward
send

Multi-channel Delivery

Notifications are dispatched simultaneously to all configured channels — Slack, email, PagerDuty, and custom webhooks. Each channel receives a formatted message with incident details and a direct link to the incident page.

arrow_downward
check_circle

Resolution & Follow-up

When the incident is resolved — either automatically via service recovery or manually by a team member — a resolution notification is sent through the same channels, closing the loop.

Notification Channels

Slack

Rich message formatting with severity color coding, incident title, affected server and service, metric highlights, and a direct link to the incident page. Messages update in place when incident status changes.

Critical — MySQL crashed on prod-db-01
CPU: 94.2% | RAM: 97.8% | Disk: 41.3%
47 log lines collected across 3 sources
View incident INC-2847 →
mail

Email

Rich HTML emails via Mailgun with the full incident summary — title, severity, affected resources, metric snapshot, timeline excerpt, and a one-click link to the dashboard. Sent to individual team members or distribution lists.

pager

PagerDuty

Native PagerDuty Events API v2 integration. HostAtlas incident severity maps to PagerDuty urgency — Critical becomes High urgency, Warning becomes Low urgency. Auto-resolved incidents in HostAtlas automatically resolve the PagerDuty incident.

webhook

Outgoing Webhooks

HMAC-SHA256 signed JSON payloads delivered to your endpoints. Subscribe to incident.created, incident.updated, and incident.resolved events. Automatic retry with exponential backoff. Delivery logs available in the dashboard.

Get started

Stop grepping through logs at 3 AM.

Install the HostAtlas agent and let automatic incident detection do the investigation for you. When a service crashes, you get a complete incident report with logs, metrics, timeline, and affected infrastructure — delivered to your Slack, email, or PagerDuty within seconds. Start with the free plan. No credit card required.

Quick install

$ curl -sSL install.hostatlas.app | bash_