Observability Guide
Mantle provides three observability mechanisms: Prometheus metrics, an immutable audit trail, and structured JSON logging. This guide covers how to set up each one, what data is available, and how to integrate with external monitoring systems.
Prometheus Metrics
When running in server mode (mantle serve), Mantle exposes a /metrics endpoint in Prometheus exposition format. The endpoint is available on the same address as the API server (default :8080).
Scraping Metrics
Add Mantle to your Prometheus scrape configuration:
# prometheus.yml
scrape_configs:
- job_name: 'mantle'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8080']
Verify the endpoint is working:
curl http://localhost:8080/metrics
Available Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mantle_workflow_executions_total | Counter | workflow, status | Total number of workflow executions, labeled by workflow name and final status (completed, failed, cancelled). |
mantle_step_executions_total | Counter | workflow, step, status | Total number of step executions, labeled by workflow name, step name, and final status. |
mantle_step_duration_seconds | Histogram | workflow, step, action | Duration of step executions in seconds. Uses default Prometheus histogram buckets. |
mantle_connector_duration_seconds | Histogram | action | Duration of connector invocations in seconds, labeled by action name (e.g., http/request, ai/completion). |
mantle_active_executions | Gauge | — | Number of currently running workflow executions. |
Example PromQL Queries
Workflow execution rate (per minute):
rate(mantle_workflow_executions_total[5m]) * 60
Failure rate for a specific workflow:
rate(mantle_workflow_executions_total{workflow="daily-report", status="failed"}[1h])
/
rate(mantle_workflow_executions_total{workflow="daily-report"}[1h])
P95 step duration for a workflow:
histogram_quantile(0.95, rate(mantle_step_duration_seconds_bucket{workflow="daily-report"}[5m]))
Average connector latency by action type:
rate(mantle_connector_duration_seconds_sum[5m])
/
rate(mantle_connector_duration_seconds_count[5m])
Currently active executions:
mantle_active_executions
Slowest connectors (P99):
histogram_quantile(0.99, sum by (action, le) (rate(mantle_connector_duration_seconds_bucket[5m])))
Grafana Dashboard
A useful Grafana dashboard for Mantle includes these panels:
- Execution rate —
rate(mantle_workflow_executions_total[5m])stacked by workflow - Failure rate —
rate(mantle_workflow_executions_total{status="failed"}[5m])with alert threshold - Active executions —
mantle_active_executionsas a stat panel - Step duration heatmap —
mantle_step_duration_seconds_bucketas a heatmap - Connector latency —
mantle_connector_duration_secondsby action as a time series
Alerting Examples
Alert: workflow failure rate exceeds 10%:
# Prometheus alerting rule
groups:
- name: mantle
rules:
- alert: MantleHighFailureRate
expr: |
rate(mantle_workflow_executions_total{status="failed"}[15m])
/
rate(mantle_workflow_executions_total[15m])
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Workflow {{ $labels.workflow }} failure rate is {{ $value | humanizePercentage }}"
Alert: AI connector latency exceeds 30 seconds:
- alert: MantleSlowAIConnector
expr: |
histogram_quantile(0.95, rate(mantle_connector_duration_seconds_bucket{action="ai/completion"}[5m])) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "AI connector P95 latency is {{ $value }}s"
Audit Trail
Every state-changing operation in Mantle emits an audit event to the audit_events table in Postgres. The table is append-only — events cannot be updated or deleted. This provides a tamper-resistant record of all activity.
What Gets Audited
| Action | Trigger | Description |
|---|---|---|
workflow.applied | mantle apply | A new workflow version was stored in the database. |
workflow.executed | mantle run, cron, webhook, API | A workflow execution started. |
step.started | Engine | A step began executing. |
step.completed | Engine | A step finished successfully. |
step.failed | Engine | A step failed (error is recorded). |
step.skipped | Engine | A step was skipped because its if condition evaluated to false. |
execution.cancelled | mantle cancel, API | An execution was cancelled. |
Audit Event Structure
Each audit event contains:
| Field | Description |
|---|---|
id | Unique event identifier. |
timestamp | When the event occurred (UTC). |
actor | Who performed the action: cli, engine, scheduler, or a user ID. |
action | The action type (e.g., workflow.applied). |
resource_type | The type of resource affected (e.g., workflow_definition, workflow_execution, step_execution). |
resource_id | The identifier of the affected resource. |
before_state | Optional JSON: the state of the resource before the change. |
after_state | Optional JSON: the state of the resource after the change. |
metadata | Optional key-value pairs with additional context. |
Querying Audit Events
Use the mantle audit command to query audit events from the command line:
# All recent events (default: last 50)
mantle audit
# Filter by action type
mantle audit --action workflow.applied
# Filter by actor
mantle audit --actor cli
# Filter by resource
mantle audit --resource workflow_definition/hello-world
# Filter by time range
mantle audit --since 24h
# Combine filters
mantle audit --action step.failed --since 7d --limit 100
Output format:
2026-03-18T14:30:00Z cli workflow.applied workflow_definition/hello-world
2026-03-18T14:30:01Z engine workflow.executed workflow_execution/a1b2c3d4
2026-03-18T14:30:01Z engine step.started step_execution/fetch
2026-03-18T14:30:02Z engine step.completed step_execution/fetch
Querying Audit Events via SQL
For advanced queries, you can query the audit_events table directly:
-- Find all failed steps in the last 24 hours
SELECT timestamp, actor, action, resource_type, resource_id
FROM audit_events
WHERE action = 'step.failed'
AND timestamp >= NOW() - INTERVAL '24 hours'
ORDER BY timestamp DESC;
-- Count events by action type
SELECT action, COUNT(*) as count
FROM audit_events
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY action
ORDER BY count DESC;
Audit Retention
Mantle does not automatically delete old audit events. In a long-running production deployment, you may want to set up a retention policy. Example using a Postgres cron job (via pg_cron):
-- Delete audit events older than 90 days
DELETE FROM audit_events WHERE timestamp < NOW() - INTERVAL '90 days';
Run this as a scheduled task outside of Mantle.
Structured JSON Logging
In server mode, Mantle emits structured JSON logs to stdout via Go’s slog package. Each log line is a self-contained JSON object.
Log Format
{"time":"2026-03-18T14:30:00.000Z","level":"INFO","msg":"server listening","address":":8080"}
{"time":"2026-03-18T14:30:01.000Z","level":"INFO","msg":"cron scheduler started"}
{"time":"2026-03-18T14:30:02.000Z","level":"INFO","msg":"workflow execution started","workflow":"hello-world","execution_id":"abc123"}
{"time":"2026-03-18T14:30:03.000Z","level":"ERROR","msg":"step failed","workflow":"hello-world","step":"fetch","error":"connection refused"}
Log Levels
| Level | Description |
|---|---|
debug | Detailed diagnostic information. Includes connector request/response details and CEL expression evaluation. |
info | Normal operational events. Server startup, workflow executions, cron triggers. |
warn | Abnormal situations that do not prevent operation. Shutdown timeouts, deprecated features. |
error | Errors that caused a step or operation to fail. |
Configure the level with:
# CLI flag
mantle serve --log-level debug
# Environment variable
export MANTLE_LOG_LEVEL=warn
# Config file
# mantle.yaml
log:
level: info
Integration with Log Aggregation
The JSON log format integrates directly with standard log aggregation systems.
Datadog:
Datadog’s agent automatically parses JSON logs from stdout. If running Mantle in a container, configure the Datadog agent to collect container logs:
# datadog.yaml (agent config)
logs:
- type: docker
service: mantle
source: go
ELK Stack (Elasticsearch, Logstash, Kibana):
Configure Filebeat to read Mantle’s stdout (typically via container logs or a file redirect):
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- /var/log/containers/mantle-*.log
json.keys_under_root: true
json.add_error_key: true
Grafana Loki:
Use Promtail to ship logs from Mantle containers to Loki:
# promtail.yaml
scrape_configs:
- job_name: mantle
static_configs:
- targets: [localhost]
labels:
job: mantle
__path__: /var/log/mantle/*.log
pipeline_stages:
- json:
expressions:
level: level
msg: msg
- labels:
level:
CloudWatch Logs:
When running on ECS or EKS, stdout logs are automatically captured by the awslogs log driver. No additional configuration is needed.
Correlating Logs with Metrics and Audit Events
All three observability signals share common identifiers:
| Identifier | In Logs | In Metrics | In Audit Events |
|---|---|---|---|
| Workflow name | workflow field | workflow label | resource_id (for workflow resources) |
| Execution ID | execution_id field | — | resource_id (for execution resources) |
| Step name | step field | step label | resource_id (for step resources) |
To trace a specific execution:
- Find the execution in
mantle logs <execution-id>orGET /api/v1/executions/{id} - Search structured logs for
execution_idto see detailed step-level events - Query audit events with
mantle audit --resource workflow_execution/<id>for the audit trail - Check Prometheus metrics for timing data with the
workflowlabel
REST API for Execution Queries
In addition to the CLI and Prometheus metrics, you can query execution data through the REST API. This is useful for building custom dashboards or integrating with external monitoring systems.
# List recent executions
curl -s "http://localhost:8080/api/v1/executions?limit=10" | jq .
# Get details for a specific execution
curl -s "http://localhost:8080/api/v1/executions/abc123-def456" | jq .
# Filter by workflow and status
curl -s "http://localhost:8080/api/v1/executions?workflow=daily-report&status=failed&since=24h" | jq .
See the CLI Reference for the full API specification.
Further Reading
- CLI Reference —
mantle auditcommand and flags - Server Commands — REST API endpoint documentation
- Server Guide — running
mantle servewith health probes and metrics - Concepts — architectural overview of the observability stack
- Configuration — log level and server address configuration