j17 - Event sourcing for the rest of us

Sagas orchestrate long-running workflows across multiple aggregates. Unlike implications (which react automatically), sagas are explicit: a trigger event starts them, they execute steps in order, and they handle failure with compensation.

When to use sagas

Good for: - Multi-step business processes (checkout, onboarding, provisioning) - Operations spanning external services (payment APIs, email, webhooks) - Processes needing human approval - Anything where partial failure requires rollback

Not for: - Simple single-aggregate updates (use regular events) - Automatic reactions to events (use implications) - Single delayed actions (use scheduled events)

You might not need a saga

If it's a single delayed action -- send a reminder in 24 hours, expire a token next week -- use a scheduled event. Sagas are for multi-step workflows where steps depend on each other and failures need to unwind previous work.

Defining sagas

Sagas are defined in modules.sagas in your spec:

{
  "modules": {
    "sagas": {
      "checkout": {
        "trigger": {
          "aggregate_type": "cart",
          "event_type": "checkout_started"
        },
        "steps": [
          {
            "name": "reserve_inventory",
            "emit": {
              "aggregate_type": "inventory",
              "id": "$.data.product_id",
              "event_type": "reservation_requested",
              "data": {
                "quantity": "$.data.quantity"
              }
            },
            "await": {
              "aggregate_type": "inventory",
              "event_types": ["was_reserved", "reservation_failed"]
            },
            "compensate": {
              "aggregate_type": "inventory",
              "id": "$.data.product_id",
              "event_type": "reservation_released",
              "data": {}
            },
            "timeout_ms": 30000
          },
          {
            "name": "charge_payment",
            "condition": {"equals": ["$prev.type", "was_reserved"]},
            "emit": {
              "aggregate_type": "payment",
              "id": "$.data.payment_id",
              "event_type": "charge_requested",
              "data": {"amount": "$.data.total"}
            },
            "await": {
              "aggregate_type": "payment",
              "event_types": ["was_charged", "charge_failed"]
            },
            "compensate": {
              "aggregate_type": "payment",
              "id": "$.data.payment_id",
              "event_type": "refund_requested",
              "data": {}
            }
          },
          {
            "name": "notify",
            "condition": {"equals": ["$prev.type", "was_charged"]},
            "emit": {
              "aggregate_type": "notification",
              "id": "$.data.customer_id",
              "event_type": "order_confirmation_sent",
              "data": {
                "transaction_id": "$prev.data.transaction_id"
              }
            }
          }
        ],
        "on_complete": {
          "aggregate_type": "order",
          "id": "$.data.order_id",
          "event_type": "checkout_completed",
          "data": {}
        },
        "on_failed": {
          "aggregate_type": "order",
          "id": "$.data.order_id",
          "event_type": "checkout_failed",
          "data": {"error": "$error.message"}
        }
      }
    }
  }
}

Trigger

"trigger": {
  "aggregate_type": "cart",
  "event_type": "checkout_started"
}

When checkout_started is written to any cart, this saga starts.

Steps

Each step has a name, an event to emit, and optionally: - await - event types to wait for (step completes when one arrives) - compensate - event to emit if a later step fails - timeout_ms - how long to wait (required for steps with await) - condition - only run if true

Referencing data

Templates pull data from the trigger event and previous steps. Sagas have access to all standard event paths ($.key, $.id, $.type, $.data.*, $.metadata.*, @.*) plus saga-specific context:

Template	Description
`$prev.type`	Previous step's response type
`$prev.data.*`	Previous step's response data
`$context.<step>.*`	Earlier step's result by name
`$error.message`	Error message (in `on_failed`)
`$error.step`	Failed step name (in `on_failed`)

@.* gives you the trigger aggregate's computed state at the moment the saga was created. This is a snapshot — if the aggregate changes after the saga starts, steps still see the original state. Use this when a saga step needs data from the aggregate that wasn't part of the trigger event (e.g., a customer_id that was set at creation time but isn't in the had_followup_converted event). ($.state.* is a deprecated alias for @.*.)

Type preservation. Templates pass values through without type coercion. If @.customer_id resolves to an object but the target schema expects "type": "string", the step will fail with a type mismatch. Make sure the data types in your aggregate state match what the target event schema declares.

Optional fields. Append ? to any template path to make it optional. If the path resolves to null, the key is omitted from the emitted event data instead of being set to null. This is useful when the trigger event may or may not include a field:

"data": {
  "customer_id": "$.data.customer_id",
  "is_tax_exempt": "$.data.is_tax_exempt?"
}

If is_tax_exempt isn't present in the trigger event, it's simply left out of the emitted event rather than being set to null (which would fail a "type": "boolean" schema check).

The condition on charge_payment checks that inventory was actually reserved before charging. If the previous step returned reservation_failed, this step is skipped.

How sagas execute

Trigger event is written, saga starts
Execute step 1: Emit inventory.reservation_requested
Wait: For a matching await event or timeout
On success: Proceed to step 2
On failure: Run compensations in reverse order, mark saga failed

Each step emits its event, then waits for a response matching one of the await.event_types. If no response arrives within timeout_ms, the step fails.

Steps without await

Steps without await succeed immediately after emitting. Use this for fire-and-forget actions like logging:

{
  "name": "audit_log",
  "emit": {
    "aggregate_type": "audit",
    "id": "global",
    "event_type": "checkout_attempted",
    "data": {"cart_id": "$.key"}
  }
}

Compensation

When a step fails, the saga walks backward through completed steps and emits their compensate events:

Step 3 fails
-> Compensate step 2
-> Compensate step 1
-> Saga marked failed

Steps without compensate are skipped during rollback (like notify -- nothing to undo).

If compensation itself fails after 3 attempts (max_compensation_attempts), the saga is dead-lettered for manual review.

Lifecycle events

on_complete fires when all steps succeed. on_failed fires after compensation completes. Both are optional.

Saga state

Query saga status via the admin API:

GET /_admin/sagas/:id
Authorization: Bearer <operator-jwt>

Response:

{
  "saga_id": "saga_abc123",
  "type": "checkout",
  "status": "running",
  "current_step": "charge_payment",
  "started_at": 1705312800,
  "steps": [
    {"name": "reserve_inventory", "status": "completed", "completed_at": 1705312801},
    {"name": "charge_payment", "status": "waiting", "waiting_since": 1705312802}
  ]
}

Admin API

GET  /_admin/sagas?status=running     # List sagas
GET  /_admin/sagas/:id                # Get saga details
POST /_admin/sagas/:id/retry          # Retry failed saga

Filter by status (running, completed, failed, dead_lettered), environment, limit, offset.

Retry picks up where it left off, restarting from the failed step. The saga runner polls for work every 1000ms.

Dead letters

If compensation fails 3 times, the saga is dead-lettered. This requires manual intervention:

Check the dashboard (dead letters appear prominently on the home page)
Fix the underlying issue
Retry via the admin API

Example: User onboarding

{
  "modules": {
    "sagas": {
      "user_onboarding": {
        "trigger": {
          "aggregate_type": "user",
          "event_type": "was_created"
        },
        "steps": [
          {
            "name": "send_welcome_email",
            "emit": {
              "aggregate_type": "email",
              "id": "$.data.user_id",
              "event_type": "was_queued",
              "data": {"template": "welcome", "email": "$.data.email"}
            }
          },
          {
            "name": "provision_trial",
            "emit": {
              "aggregate_type": "subscription",
              "id": "$.data.user_id",
              "event_type": "trial_was_started",
              "data": {"plan": "starter"}
            },
            "compensate": {
              "aggregate_type": "subscription",
              "id": "$.data.user_id",
              "event_type": "trial_was_cancelled",
              "data": {}
            }
          }
        ]
      }
    }
  }
}

Compared to implications

	Implications	Sagas
Trigger	Automatic on event	Trigger event in spec
Scope	Event chain (atomic)	Multi-step workflow
Compensation	Manual	Built-in
Human interaction	No	Yes
Failure handling	Retry	Compensation + retry
Use case	Simple reactions	Complex processes

Use implications for if-this-then-that. Use sagas for processes requiring coordination and rollback.

Limitations

Max 50 steps per saga
Max 100 concurrent sagas per instance
Max 3 compensation attempts before dead-lettering
Saga chain depth limit: 3. A saga step can emit an event that triggers another saga, which can trigger a third. Beyond depth 3, saga and scheduled event hooks are skipped and an audit log entry is recorded. If you see saga_depth_exceeded in your audit log, simplify your saga chain or consolidate steps. Listener/webhook hooks are unaffected by the depth limit.

Best practices

Keep steps small. Each step should do one thing. Easier to compensate, easier to debug.

Design compensations first. Before writing the action, know how to undo it.

Set realistic timeouts. Account for external API latency, human response time.

Monitor compensation rate. High compensation means the process design needs rethinking.

Troubleshooting

emit_failed: type mismatch

The most common saga step failure. The step error looks like:

{
  "message": "emit_failed",
  "type": "validation_error",
  "detail": "type mismatch: expected string, got object",
  "path": "data.customer_id"
}

This means the value resolved by the template doesn't match the target event's schema type. The path tells you which property failed, and detail tells you what was expected vs. what was provided.

Common causes: - A merge handler stored an object where a string was expected. For example, if @.customer_id resolves to {"name": "Alice", "id": "abc"} instead of "abc", a schema expecting "type": "string" will reject it. - An integer in state is being passed to a schema expecting "type": "string". Templates do not coerce types. - A set handler with "value": "$.data" copied the entire event payload into a field, creating an object where a scalar was expected.

To diagnose: 1. Check the step error via the admin API: GET /_admin/sagas/:id 2. Look at error.path to identify which field failed 3. Look at error.detail for the expected vs. actual type 4. Check the trigger aggregate's state to see what the template actually resolved to

To fix: adjust the handler that builds the source aggregate state so the field has the correct type, or adjust the target event schema to accept the type being provided.

emit_failed: missing required property

{
  "message": "emit_failed",
  "type": "validation_error",
  "detail": "missing required property",
  "path": "data"
}

A template resolved to null because the path doesn't exist in the source data. For example, @.email returns null if the trigger aggregate's state doesn't have an email field, and the resulting event data omits the key entirely.

To fix: verify that the trigger event or aggregate state always includes the fields your saga steps reference, or make the field optional in the target schema.

General debugging

Inspect the saga: GET /_admin/sagas/:id shows each step's status and error details
Check step errors: failed steps include type, detail, and path fields that identify exactly what went wrong
Verify template sources: use the aggregates endpoint to confirm what @.* templates resolve to at runtime
Check schema alignment: ensure the data types in your source (trigger event data, aggregate state) match the declared types in the target event schema

Sagas