Error Monitoring & Alerting
Build workflows that triage errors, process alerts, and dispatch notifications across channels.
This guide covers building workflows whose purpose is to monitor external systems, classify errors, and route alerts. For handling errors that occur inside your own workflows, see Errors & Retrying.
Error monitoring is one of the most common workflow use cases. A typical pipeline receives error events from external systems, classifies them, deduplicates repeat occurrences, and dispatches notifications to the right channels. Workflows are a natural fit because they survive failures, retry flaky notification APIs, and maintain state across long-running monitoring loops.
Error Triage Workflow
The simplest error monitoring workflow receives an error event via webhook, classifies it by severity, and routes it to the appropriate handler.
import { createWebhook } from "workflow";
interface ErrorEvent {
source: string;
message: string;
stack?: string;
metadata?: Record<string, unknown>;
}
async function classifyError(event: ErrorEvent) {
"use step";
// Classify based on error patterns
if (event.message.includes("FATAL") || event.message.includes("OOM")) {
return "critical" as const;
}
if (event.message.includes("timeout") || event.message.includes("rate limit")) {
return "warning" as const;
}
return "info" as const;
}
async function handleCritical(event: ErrorEvent) {
"use step";
// Page on-call, create incident ticket, etc.
console.log(`CRITICAL: ${event.source} - ${event.message}`);
}
async function handleWarning(event: ErrorEvent) {
"use step";
// Post to team Slack channel
console.log(`WARNING: ${event.source} - ${event.message}`);
}
async function handleInfo(event: ErrorEvent) {
"use step";
// Log for later review
console.log(`INFO: ${event.source} - ${event.message}`);
}
export async function errorTriageWorkflow() {
"use workflow";
const webhook = createWebhook();
console.log("Listening for errors at:", webhook.url);
for await (const request of webhook) {
const event: ErrorEvent = await request.json();
const severity = await classifyError(event);
if (severity === "critical") {
await handleCritical(event);
} else if (severity === "warning") {
await handleWarning(event);
} else {
await handleInfo(event);
}
}
}The workflow creates a persistent webhook endpoint. External systems POST error events to it. Each event is classified in a step (with full Node.js access for pattern matching, database lookups, or ML inference), then routed to the correct handler. Because the webhook uses for await...of, the workflow stays alive and processes errors as they arrive.
Webhooks implement AsyncIterable, so a single workflow instance can process an unlimited stream of events over time. See Hooks & Webhooks for details on iteration and custom tokens.
Alert Processing Pipeline
Real alert pipelines need deduplication. When the same error fires hundreds of times in a minute, you want one alert, not hundreds. Use custom hook tokens to route duplicate events to the same workflow instance.
import { createHook, getStepMetadata } from "workflow";
interface Alert {
alertId: string;
source: string;
message: string;
timestamp: number;
}
interface EnrichedAlert extends Alert {
service: string;
owner: string;
runbook: string;
}
async function enrichAlert(alert: Alert): Promise<EnrichedAlert> {
"use step";
// Look up service metadata from your registry
const service = alert.source.split("/")[0];
return {
...alert,
service,
owner: `team-${service}`,
runbook: `https://runbooks.internal/${service}/${alert.alertId}`,
};
}
async function dispatchNotification(alert: EnrichedAlert) {
"use step";
const { stepId } = getStepMetadata();
await fetch("https://hooks.slack.com/services/...", {
method: "POST",
headers: { "Idempotency-Key": stepId },
body: JSON.stringify({
text: `[${alert.source}] ${alert.message}\nOwner: ${alert.owner}\nRunbook: ${alert.runbook}`,
}),
});
}
export async function alertPipelineWorkflow(alertId: string) {
"use workflow";
// Custom token ensures duplicate alerts route here
const hook = createHook<Alert>({ token: `alert:${alertId}` });
// Process the first alert
const alert = await hook;
const enriched = await enrichAlert(alert);
await dispatchNotification(enriched);
}The key pattern here is the custom hook token. When your ingestion layer receives an alert, it uses resumeHook() to deliver the payload to the waiting workflow. The hook token alert:${alertId} is deterministic, so the sender does not need to know the workflow's run ID.
import { start, resumeHook } from "workflow/api";
declare function alertPipelineWorkflow(alertId: string): Promise<void>; // @setup
export async function POST(request: Request) {
const alert = await request.json();
const token = `alert:${alert.alertId}`;
// Start the workflow first, then deliver the alert data
await start(alertPipelineWorkflow, [alert.alertId]);
// The hook may not be registered yet — retry until it is
let delivered = false;
for (let i = 0; i < 5 && !delivered; i++) {
try {
await resumeHook(token, alert);
delivered = true;
} catch {
await new Promise((r) => setTimeout(r, 500 * (i + 1)));
}
}
return Response.json({ delivered });
}The workflow needs time to register the hook after start() returns. The retry loop above handles this race. For high-throughput scenarios, consider using createWebhook() instead, which provides an HTTP endpoint that handles delivery automatically.
Real-Time Alert Dispatch
When a critical event needs immediate attention, fan out notifications to multiple channels in parallel using Promise.all. Each channel is its own step, so a failure in one (e.g., Slack API is down) does not block the others, and each is retried independently.
import { createWebhook, FatalError, RetryableError, getStepMetadata } from "workflow";
interface CriticalEvent {
title: string;
description: string;
severity: "P1" | "P2";
source: string;
}
async function sendSlackAlert(event: CriticalEvent) {
"use step";
const { stepId } = getStepMetadata();
const response = await fetch("https://hooks.slack.com/services/...", {
method: "POST",
headers: { "Idempotency-Key": stepId },
body: JSON.stringify({
text: `*${event.severity}: ${event.title}*\n${event.description}`,
}),
});
if (response.status === 429) {
throw new RetryableError("Slack rate limited", { retryAfter: "30s" });
}
if (!response.ok) {
throw new FatalError(`Slack API error: ${response.status}`);
}
}
async function sendEmailAlert(event: CriticalEvent) {
"use step";
await fetch("https://api.sendgrid.com/v3/mail/send", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.SENDGRID_KEY}` },
body: JSON.stringify({
to: "oncall@example.com",
subject: `${event.severity}: ${event.title}`,
text: event.description,
}),
});
}
async function createPagerDutyIncident(event: CriticalEvent) {
"use step";
await fetch("https://events.pagerduty.com/v2/enqueue", {
method: "POST",
body: JSON.stringify({
routing_key: process.env.PAGERDUTY_KEY,
event_action: "trigger",
payload: {
summary: `${event.severity}: ${event.title}`,
source: event.source,
severity: event.severity === "P1" ? "critical" : "error",
},
}),
});
}
export async function instantAlertWorkflow() {
"use workflow";
const webhook = createWebhook();
const request = await webhook;
const event: CriticalEvent = await request.json();
// Fan out to all channels in parallel
await Promise.all([
sendSlackAlert(event),
sendEmailAlert(event),
createPagerDutyIncident(event),
]);
}Because each notification is a separate step, the framework retries failures independently. If PagerDuty returns a 500, Slack and email still succeed, and the PagerDuty step retries on its own schedule.
External System Monitoring
Not all monitoring is event-driven. Sometimes you need to poll external systems on a schedule. The core pattern is a sleep() loop that checks a service in a step and alerts on failure:
import { sleep } from "workflow";
declare function checkServiceHealth(endpoint: string): Promise<{ healthy: boolean; latency: number }>; // @setup
declare function sendAlert(service: string, latency: number): Promise<void>; // @setup
export async function monitorServiceWorkflow(service: string, endpoint: string) {
"use workflow";
while (true) {
const status = await checkServiceHealth(endpoint);
if (!status.healthy) {
await sendAlert(service, status.latency);
}
await sleep("5m");
}
}This uses the durable polling pattern — sleep() consumes no compute while waiting, and the workflow resumes at the correct time even after restarts. For more advanced scheduling patterns including consecutive failure tracking, graceful shutdown, and cron-like dispatching, see Scheduling & Cron.
sleep() accepts duration strings like "5m", "1h", or "30s", as well as Date objects for sleeping until a specific time. See the sleep() API reference for all supported formats.
Content Security Scanning
Workflows can also monitor content against security or policy rules. This pattern receives content via webhook, scans it in a step, and takes action on violations.
import { createWebhook } from "workflow";
interface ContentEvent {
contentId: string;
body: string;
author: string;
type: "post" | "comment" | "message";
}
interface ScanResult {
passed: boolean;
violations: string[];
}
async function scanContent(event: ContentEvent): Promise<ScanResult> {
"use step";
const violations: string[] = [];
// Check against policy rules
const blockedPatterns = [/credential/i, /api[_-]?key/i, /password\s*=/i];
for (const pattern of blockedPatterns) {
if (pattern.test(event.body)) {
violations.push(`Blocked pattern: ${pattern.source}`);
}
}
return { passed: violations.length === 0, violations };
}
async function quarantineContent(contentId: string, violations: string[]) {
"use step";
// Move content to review queue
await fetch("https://api.internal/content/quarantine", {
method: "POST",
body: JSON.stringify({ contentId, violations }),
});
}
async function notifySecurityTeam(event: ContentEvent, result: ScanResult) {
"use step";
await fetch("https://hooks.slack.com/services/...", {
method: "POST",
body: JSON.stringify({
text: `Content violation in ${event.type} by ${event.author}: ${result.violations.join(", ")}`,
}),
});
}
export async function contentSecurityWorkflow() {
"use workflow";
const webhook = createWebhook();
for await (const request of webhook) {
const event: ContentEvent = await request.json();
const result = await scanContent(event);
if (!result.passed) {
await Promise.all([
quarantineContent(event.contentId, result.violations),
notifySecurityTeam(event, result),
]);
}
}
}The scanning step has full Node.js access, so it can call external scanning APIs, run regex-based rules, or invoke ML models. When a violation is found, the workflow quarantines the content and notifies the security team in parallel.
Related Documentation
- Errors & Retrying - Handle errors inside your own steps with retry semantics
- Hooks & Webhooks - Deep dive on hooks, webhooks, and custom tokens
- Common Patterns - Sequential, parallel, timeout, and composition patterns
createWebhook()API Reference - Full webhook API documentationcreateHook()API Reference - Full hook API documentationsleep()API Reference - Sleep and scheduling API