> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ankra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Alerts

> Monitor your clusters and resources with customizable alert rules

<Note>
  Ankra Alerts help you stay informed about your cluster health, resource issues, and operational events with configurable notifications.
</Note>

## What are Alerts?

Alerts in Ankra let you define rules that automatically monitor your infrastructure and notify you when specific conditions are met. You can:

* **Monitor Cluster Health:** Get notified when clusters go offline or agents disconnect
* **Track Resource Status:** Watch for issues with GitOps repositories, add-ons, manifests, and stacks
* **Configure Conditions:** Set up multiple conditions with AND/OR logic for precise alerting
* **Receive Notifications:** Send alerts to any webhook-enabled service (Slack, Teams, PagerDuty, etc.)

### How Alerts Work

<Frame>
  ```mermaid theme={null}
  flowchart LR
      A[Resource State Change] --> B[Condition Evaluation]
      B --> C{Conditions Met?}
      C -->|Yes| D[Alert Triggered]
      C -->|No| E[No Action]
      D --> F[Webhook Notifications]
      D --> G[AI Incident Created]
      F --> H[Slack / Teams / PagerDuty]
  ```
</Frame>

***

## Automatic AI Analysis

Every time an alert triggers, Ankra automatically starts an AI-powered analysis to help you understand what went wrong.

<Frame>
  ```mermaid theme={null}
  flowchart LR
      A[Alert Triggers] --> B[AI Analysis Resource Created]
      B --> C[Analysis Job Scheduled]
      C --> D[Data Collection]
      D --> E[AI Analysis]
      E --> F[AI Incident Ready]
  ```
</Frame>

### What Happens Behind the Scenes

1. **AI Analysis Resource** - When conditions are met, Ankra creates an AI Analysis resource to track the investigation
2. **Analysis Job** - A background job is scheduled to collect and analyze data
3. **Data Collection** - The job gathers:
   * Pod status and container states
   * Kubernetes events (warnings, errors)
   * Container logs (recent output)
   * Job results and error messages
4. **AI Analysis** - Ankra's AI processes the data to identify:
   * Root cause of the issue
   * Severity assessment
   * Affected resources
   * Recommended actions
5. **AI Incident** - Results are saved as an AI Incident for review

<Note>
  View all AI Incidents in the **Alerts → AI Incidents** tab. Each incident includes the full analysis, affected resources, and an interactive checklist of recommended actions.
</Note>

***

## Alerts Dashboard

The Alerts page displays key metrics at a glance:

| Metric             | Description                                        |
| ------------------ | -------------------------------------------------- |
| **Total Alerts**   | Number of configured alert rules                   |
| **Active**         | Alerts currently enabled and monitoring            |
| **Rules**          | Total number of individual rules across all alerts |
| **Triggers (24h)** | How many times alerts fired in the last 24 hours   |

You can filter alerts by status (Active/Disabled) using the quick filter pills on the dashboard.

***

## Alert Structure

An alert consists of rules, conditions, and webhook integrations:

<Frame>
  ```mermaid theme={null}
  flowchart TD
      subgraph alert [Alert]
          A[Name / Description / Enabled]
      end

      subgraph rules [Rules]
          R1[Resource Type / Clusters / Severity / Cooldown]
      end

      subgraph conditions [Conditions]
          C1[Condition 1]
          C2[Condition 2]
      end
      subgraph webhooks [Webhooks]
          W1[Slack]
          W2[PagerDuty]
      end
      alert --> rules
      rules --> conditions
      C1 -->|AND / OR| C2
      alert --> webhooks
  ```
</Frame>

***

## Creating Alert Rules

<Steps>
  <Step title="Name & Details">
    Configure the basic alert settings:

    * **Alert Name:** A descriptive name (e.g., "Production Resources Down")
    * **Severity:** Choose Critical, Warning, or Info
    * **Cooldown:** Time in minutes before the alert can trigger again (prevents alert storms)
    * **Clusters:** Select "All Clusters" or choose specific clusters to monitor
  </Step>

  <Step title="What to Monitor">
    Select the resource type to monitor:

    <CardGroup cols={2}>
      <Card title="Cluster" icon="server">
        Monitor overall cluster health including:

        * Cluster connectivity state
        * Agent status and availability
      </Card>

      <Card title="Cluster Resource" icon="cubes">
        Monitor specific resource types:

        * **GitOps** - GitHub repository sync status
        * **Addon** - Add-on deployment health
        * **Manifest** - Raw manifest deployments
        * **Stack** - Stack deployment status
      </Card>
    </CardGroup>
  </Step>

  <Step title="When to Alert">
    Define one or more conditions that trigger the alert. Multiple conditions can be combined using **AND** or **OR** logic.

    **For Cluster monitoring:**

    * **Cluster State:** Alert when cluster goes offline/online
    * **Agent Status:** Alert on agent offline, online, upgrade available, or upgrading

    **For Cluster Resource monitoring:**

    * **Resource State:** Alert on state changes (up, down, creating, updating, stopping)
    * **Job Status:** Alert on job outcomes (failed, timeout, blocked, etc.)
    * **Stuck Duration:** Alert when resources remain in a state too long
    * **Failed Job Count:** Alert when failed jobs exceed a threshold
  </Step>

  <Step title="Notifications">
    Select which webhooks should receive notifications when this alert triggers. You can configure webhooks in the [Webhooks](/guides/webhooks) section.
  </Step>
</Steps>

***

## Condition Types Reference

### Cluster Conditions

Use these when monitoring **Cluster** resource type:

| Condition         | Description                  | Values                                                |
| ----------------- | ---------------------------- | ----------------------------------------------------- |
| **Cluster State** | Monitor cluster connectivity | `offline`, `online`                                   |
| **Agent Status**  | Monitor the cluster agent    | `offline`, `online`, `upgrade_available`, `upgrading` |

### Cluster Resource Conditions

Use these when monitoring **Cluster Resource** types (GitOps, Addon, Manifest, Stack):

| Condition            | Description                   | Values/Options                                                                             |
| -------------------- | ----------------------------- | ------------------------------------------------------------------------------------------ |
| **Resource State**   | Monitor resource health state | `all_up`, `up`, `creating`, `updating`, `stopping`, `down`                                 |
| **Job Status**       | Monitor job execution status  | `blocked`, `failed`, `pending`, `running`, `success`, `timeout`, `cancelling`, `cancelled` |
| **Stuck Duration**   | Alert when stuck in a state   | Duration in minutes (e.g., 30)                                                             |
| **Failed Job Count** | Alert on failure threshold    | Number of failed jobs (e.g., 5)                                                            |

### Condition Operators

For numeric conditions (Stuck Duration, Failed Job Count), you can use these operators:

| Operator | Description           |
| -------- | --------------------- |
| `eq`     | Equals                |
| `neq`    | Not equals            |
| `gt`     | Greater than          |
| `gte`    | Greater than or equal |
| `lt`     | Less than             |
| `lte`    | Less than or equal    |

***

## Alert Severity Levels

| Level        | Description               | Use Case                                |
| ------------ | ------------------------- | --------------------------------------- |
| **Critical** | Immediate action required | Production outages, service unavailable |
| **Warning**  | Attention needed soon     | Resource pressure, degraded performance |
| **Info**     | Informational updates     | Successful deployments, routine events  |

***

## Viewing Alert History

Each alert tracks its trigger history. From the alert detail page, you can:

* **View Statistics:** Total triggers, last check time, last triggered time
* **Review History:** See when alerts triggered and what conditions matched
* **Analyze Patterns:** Identify recurring issues across your infrastructure

Navigate to **Alerts → \[Alert Name] → View History** to access the full timeline.

***

## Related

* [AI Incidents](/platform/ai-incidents) - AI-powered root cause analysis for triggered alerts
* [Webhooks](/guides/webhooks) - Configure notification endpoints for alerts

***

Still have questions? [Join our Slack community](https://join.slack.com/t/ankra-community/shared_invite/zt-3a5rem8f8-cUho4epX2MoLT83bFf~VSA) and we'll help out.
