Cost attribution, reporting & management

How pond measures what a run costs, attributes it to the dimensions an enterprise bills on (team / tenant / cost-center / env), reports it, and enforces budgets — across any provider and harness, never tailored to one.

Status: M1–M5 shipped. Built incrementally (see Build sequence). M1 — the labels spine (runs.labels / telemetry_events.labels, JSONB+GIN, migration 0005), the required-label policy (POND_REQUIRED_RUN_LABELS), the per-provider usage extractor registry (swarm/src/sandbox.py), and the exposed rollup (GET /v1/usage, GET /v1/admin/usage, pond usage). M2 — the rate-card pricing primitive + pluggable pricer (app/pricing.py, model pricing JSONB, migration 0006), cached-meter extraction, usage_source/cost_source markers, and usage folded into the signed attestation (a :v2 canonical-message bump in app/crypto/attestation.py + swarm/src/security.py). M3 — the usage_rollups daily rollup table + idempotent recompute job (app/usage_rollup.py, migration 0007), a rollup-backed daily report, and a line-item CSV/JSON export. M4 — budgets (migration 0008) with admission control at submit (block → 422, warn → log), budget.alert threshold telemetry, and a scope-aware watchdog that cancels over-budget runs mid-flight (app/budgets.py). M5 — provider-invoice reconciliation (app/reconciliation.py, cost_reconciliations, migration 0009): an actual-vs-estimated true-up per (provider, period) yielding a reconciliation factor — the chargeback-grade number, kept distinct from the estimate. The invariants below must hold from each commit.

Code today: app/telemetry.py (emit), app/agent_usage.py (rollups + pricing), app/models.py (telemetry_events, agent_models), app/routers/orch.py (/orch/jobs/{jid}/telemetry ingest), app/run_watchdog.py (budget cancel), swarm/src/sandbox.py (broker usage capture), app/obs.py (OTel metrics).

The shape of the problem

“Cost” is three jobs with different owners. They stack — each needs the one below.

Measurement — what did this cost? Tokens in/out per model call, priced, attributed. The factual spine.
Reporting — who spent what, over what period? Rollups, slices, exports.
Management — don’t exceed X. Budgets, admission control, alerts, kills.

Primary audience: enterprise FinOps / eng-lead — chargeback/showback by team/tenant/cost-center. That target drives every decision below.

What exists today (the foundation)

Pond already has a real cost spine; the gaps are attribution breadth, exposure, pricing fidelity, and cross-run control.

Capability	State	Where
Append-only event store, server-stamped attribution	built	`telemetry_events` (`models.py`), `orch.py` ingest
Token + cost capture at the credential broker	built	`swarm/src/sandbox.py` `_extract_usage()`
Provider-reported cost when present, else per-Mtok compute	built	`agent_usage.py` `_cost()`
Read-time rollups (run / project / model / harness / window)	built, mostly unexposed	`agent_usage.py` `aggregate()`
Per-run cost ceiling + wall-clock, preemptive cancel	built	`run_watchdog.py`
OTel cost/token/call metrics	built	`obs.py`

Two facts shape the whole design:

Attribution is trustworthy; magnitude is not (on BYO). Pond stamps run_id/project_id/stage_key/job_id server-side at ingest — a worker cannot spoof what a cost belongs to. But the token counts originate in the worker’s broker, so an untrusted/BYO pool could under- or over-report. This is the load-bearing constraint (see Trust of the number).
aggregate() already computes the FinOps rollup and has no endpoint. Exposing what’s already written is the single highest-leverage step.

Decisions

Attribution model → generic labels + required-label policy

Pond stays product-neutral, so attribution is an open labels: dict[str,str] map on run submission — not baked-in tenant/team columns. Stored on runs (JSONB, GIN-indexed) and denormalized onto each telemetry_event at ingest, stamped server-side from the run exactly like project_id — so labels inherit the no-spoof property and rollups stay a single-table group-by (no join).

project_id / workflow_id remain first-class soft refs; labels are the open extension. The FinOps-critical part: pond can enforce required label keys per consumer/org — reject a run missing costCenter/team. Chargeback is impossible if any spend can go untagged. (AWS cost-allocation-tags / GCP labels pattern.) Conventional keys — tenant, costCenter, team, env, user — are documented for portable reports; the map stays open.

Pricing → a rate-card primitive, not fixed columns

Fixed pricing columns (input_cost_per_mtok, cached_input_cost_per_mtok, per_request_cost, …) are an if/else ladder wearing a schema: the moment a provider invents a meter (reasoning tokens, image tokens, a conditional discount) you’re editing the model table. So pricing is the same shape as usage — both open meter maps, and cost is their dot product.

Usage is already an open bag: {tokens_in, tokens_out, cached_tokens_in, requests: 1, …}, filled by the extractor registry. New meter = new key.

Price mirrors it as a rate card — a list of {meter, unit_price, per} line items, with a currency (default USD):

{"currency": "USD", "rates": [
  {"meter": "tokens_in",        "unit_price": 3.0,  "per": 1000000},
  {"meter": "tokens_out",       "unit_price": 15.0, "per": 1000000},
  {"meter": "cached_tokens_in", "unit_price": 0.30, "per": 1000000},
  {"meter": "requests",         "unit_price": 0.001,"per": 1}
]}

Cost = Σ usage[meter] / per × unit_price.

So cached / cache-write / per-request stop being special — they’re rows. input/output are rows. The only contract is that the extractor and the rate card agree on meter names (the shared, open meter namespace — see neutrality). Back-compat: the existing *_cost_per_mtok columns read as a 2-line rate card, so current config keeps working unchanged.

Pricing is applied by a pluggable pricer — cost(usage, pricing) → money, a registry keyed like the extractors, default "rate_card". This is the escape hatch for genuinely non-linear estimate logic (volume tiers, conditional discounts) without ever putting an eval’d expression DSL in the trusted control plane: a customer registers a pricer (or a consumer-side stage_hooks.py hook) instead. We commit only to the linear rate card now; tiers/predicates are deferred until a real provider needs them.

Every cost record carries cost_source: provider_reported | computed, stored as estimated_cost_usd explicitly so a later reconciled_cost_usd lands without rewriting history. The layering, weakest-claim to strongest: rate card (covers ~all real pricing, extensible over meters) → provider- reported cost (self-pricing providers like OpenRouter — used verbatim today) → custom pricer hook (bespoke estimate logic) → invoice reconciliation (financial truth). Correctness on weird pricing comes from the provider/invoice, not from out-predicting it.

Trust of the number → estimate vs. reconciled

Do not try to make an untrusted worker’s number financially binding — that is unwinnable. Instead, two numbers with different jobs:

Worker-reported = real-time estimate. Drives UX, dashboards, and budget enforcement. Folded into the worker’s signed attestation (already signed — near-free) for non-repudiation/audit: proves the worker asserted it.
Provider invoice = chargeback truth. The reconciled figure (reconciled_cost_usd) comes from the provider’s own accounting per period, keyed by API key/metadata. On a trusted pool, estimate ≈ truth and reconciliation is a true-up; on BYO you bill from the provider, not the tenant’s self-report. Honest, rather than pretending a signature makes a number true.

Enforcement → layered, building on the watchdog

Per-run ceiling exists. Add, in order: cross-run/time budgets, admission control (pre-flight estimate vs remaining budget), soft alerts (thresholds, separate from the hard kill), and a scope-aware watchdog (project budget exhausted → cancel running runs per policy).

Provider & harness neutrality

The non-negotiable: cost must never become codex-shaped or Anthropic-shaped. This rests on one invariant, two registries, and a shared meter namespace.

Invariant — capture at the broker, never the harness. Usage is read at the credential broker, an HTTP proxy on the agent’s model-API path (swarm/src/sandbox.py). It is below whatever CLI runs upstream — codex, claude, aider, a custom harness are all invisible to it. We measure tokens by watching API traffic, never by parsing a CLI’s stdout. The tempting shortcut (parse codex’s usage line) would silently couple us to codex and is forbidden. Harnesses remain a registry (agent_harnesses, harness.* capability) — adding one is a row, not a code change; the cost layer must not regress that.

Registry — per-provider usage extractors. Response shapes differ by provider, so _extract_usage becomes a registry keyed by the model’s provider, not an if/else ladder:

Provider	Tokens	Cached	Cost
`anthropic`	`usage.input_tokens` / `output_tokens`	`cache_read_input_tokens` / `cache_creation_input_tokens`	computed
`openai` (+ Azure, compatible)	`usage.prompt_tokens` / `completion_tokens`	`prompt_tokens_details.cached_tokens`	computed
`openrouter`	OpenAI shape	OpenAI shape	inline `cost`
`bedrock` / `vertex` / `custom`	provider shape	if present	computed
default	OpenAI-compatible	if present	computed

Adding a provider = registering an extractor (same pattern as harnesses / executors). An unknown provider falls back to the OpenAI-compatible default and degrades gracefully rather than recording zero.

Failure mode to design for now. Some provider/transport combos return usage only in a terminal streaming event, or not at all. The broker must parse streaming usage events (not just JSON bodies), record a usage_source: provider_body | stream_event | estimated_from_chars | unavailable marker so reports are honest about confidence, and fall back to a tokenizer approximation (flagged low-confidence, never silently dropped) when a provider gives nothing. Low-confidence estimates are exactly why chargeback truth comes from provider reconciliation, which is provider-native.

Shared meter namespace — the usage⋅price contract. The extractor writes a usage map {meter: quantity}; the rate card prices a map {meter: rate}; cost is their dot product. Both sides are open: a new provider meter is a new key on both, no schema change. The only coupling is that extractor and rate card agree on meter names — so the meter vocabulary (tokens_in, tokens_out, cached_tokens_in, requests, …) is the contract to keep stable, documented next to the extractor registry.

Net per layer: pricing is keyed by (provider, model_id) with org/project scope, so a mixed Anthropic + OpenAI + Bedrock fleet prices and reports side by side; rollups group by provider/model/harness as plain data dimensions; budgets are denominated in normalized cost (USD via the rate card’s currency), so one budget spans providers uniformly.

The three layers

Layer 1 — Measurement & attribution

labels (JSONB+GIN) on runs → stamped onto telemetry at ingest; required-label policy; per-provider extractor registry; the rate-card pricing primitive + pluggable pricer (over the shared meter namespace); cost_source + usage_source markers; usage folded into the signed attestation.

Layer 2 — Reporting

Expose aggregate(): GET /v1/usage (consumer-scoped) + GET /v1/admin/usage (org-wide), with group_by (model | harness | project | stage | label key | time-bucket), a time range, and filters. Add line-item exports (CSV/JSON, per-run or per-model.call) — FinOps tools (Cloudability / CloudHealth / Apptio) and warehouses ingest raw rows, not just rollups. For scale, add a daily rollup table (scope × label-set × model × day) from a rollup job; serve reports from it, keep raw telemetry for drill-down.

Layer 3 — Management

A budgets table: scope (org | project | label-match), period (day | month | rolling-N), limit, action (warn | block). Admission control at POST /v1/runs checks remaining budget against a pre-submission estimate (historical avg for the workflow, or max-tokens × price). Soft alerts at 50/80/100% emit telemetry events + a webhook, separate from the hard kill. Scope-aware watchdog extends the per-run ceiling to budget scopes.

Build sequence

Each step independently shippable.

#	Delivers	Contents
M1 ✅	showback + neutrality from day one	`labels` spine + required-label policy; per-provider extractor registry (so we’re never codex/Anthropic-shaped); expose `aggregate()` as `/v1/usage` + `/v1/admin/usage` with label group-by
M2 ✅	pricing fidelity	rate-card pricing (`{meter, unit_price, per}` summed) + pluggable pricer, replacing fixed cost columns (legacy columns read as a 2-line card); broker emits extra meters (cached) + streaming usage; `cost_source`/`usage_source`; usage folded into the signed attestation (a `:v2` canonical-message bump on both codepaths — worker signs its reported meter map for audit/non-repudiation; the billed numbers travel the separate unsigned telemetry path, see Known limitations). Deferred (minor): the `estimated_from_chars` unavailable path
M3 ✅	scale + FinOps ingestion	`usage_rollups` table (migration `0007`) + idempotent recompute-day job (`app/usage_rollup.py`, `POST /v1/admin/usage/rollup`); rollup-backed daily report (`GET /v1/admin/usage/daily`); line-item CSV/JSON export (`GET /v1/admin/usage/export`); `pond usage rollup/daily/export`
M4 ✅	control	`budgets` table (migration `0008`, scope org/project/label × period day/month/rolling, action warn/block) + CRUD; admission control at submit/preflight with a recent-mean estimate (`app/budgets.py`); `budget.alert` threshold telemetry (50/80/100%, deduped); scope-aware watchdog cancels over-budget runs mid-flight; `pond budget create/list/status/delete`
M5 ✅	chargeback-grade truth	provider-invoice import (`app/reconciliation.py`, `cost_reconciliations`, migration `0009`); actual-vs-estimated true-up → a reconciliation factor per (provider, period); `POST/GET /v1/admin/reconciliations`, `pond reconcile import/list`

M1 = showback. M1–M4 = showback + control. M5 = true chargeback. The extractor registry is in M1 on purpose — neutrality is a foundation, not a retrofit.

Invariants

Usage is captured at the broker, never the harness. No cost code parses any agent CLI’s output; it reads model-API traffic only.
Adding a provider is a registry entry, not a branch; an unknown provider falls back to the OpenAI-compatible extractor and degrades gracefully.
Pricing is a rate card over open meters, not fixed columns. A new meter is a new line item on both the usage map and the rate card; cost is their dot product. Non-linear/bespoke pricing goes through a pluggable pricer, provider- reported cost, or reconciliation — never an eval’d expression in core.
Attribution is server-stamped and unspoofable — labels, project_id, run_id, stage_key, job_id are set by pond at ingest, not by the worker.
No untagged spend under a required-label policy — a run missing a required label key is refused.
The worker signs its reported usage (v2 attestation): the canonical message binds a digest of the reported meter map, so a relaying orchestrator can’t alter that signed map undetected. Audit/non-repudiation only — not the billed number (which travels the unsigned telemetry path) and not financial truth (that’s reconciliation). See Known limitations.
Cost is clamped at ≥ 0 — a mis-authored negative rate or a negative provider-reported cost can never produce negative cost, which would otherwise silently defeat the budget kill-switch and under-count spend.
Two cost numbers stay distinct: worker-reported estimated_cost_usd (real-time, drives budgets) and provider-reconciled reconciled_cost_usd (chargeback truth). The worker’s number is never treated as financially binding on an untrusted pool.
Usage is never silently dropped — when a provider reports nothing, the record carries usage_source = unavailable | estimated_from_chars, flagged low-confidence.
Budgets are provider-neutral — denominated in normalized cost, spanning a mixed-provider fleet uniformly.

Known limitations & follow-ups

Surfaced by the M1–M5 verification audit; recorded here rather than hidden.

Deploy order (attestation v1→v2). A current worker always signs :v2; a pre-M2 pond verifier only checks :v1 and would mark such results attestation_verified=False. Upgrade pond (the verifier) before workers. The reverse (new pond, old :v1 worker) verifies fine.
Signed usage ≠ billed usage. The attestation signs the worker’s aggregate usage (non-repudiation). Budgets/rollups/reconciliation bill from the separate, unsigned /orch/.../telemetry path. Nothing yet cross-checks the two — a reconcile step (compare the attested digest to the telemetry sum, flag divergence) is a follow-up. Don’t treat signed usage as the tamper-proof bill.
Budget spend is a live telemetry scan. spend_micros prices every event in the period in Python, per applicable budget, on the submit/preflight/watchdog paths — O(period volume). Fine at modest scale; the follow-up is rollup-backed spend (sum usage_rollups.cost_micros over closed days + a bounded current-day live scan).
Admission is enforced at the /v1 boundary, not in runservice.submit_run. A future in-process caller bypasses admission + required-labels; the watchdog still backstops block budgets mid-flight, but warn budgets and the pre-submission estimate are not backstopped.
Alerts piggyback on the run watchdog. Threshold alerts fire while a matching run is live; a crossing with no applicable run live (e.g. between runs) lags until the next one. Dedup is best-effort (a per-(budget, threshold, period) query, not an atomic constraint) so concurrent watchdogs can rarely double-emit a soft alert. An independent sweep is the follow-up.
Reconciliation is a factor true-up. Spend with no registered model and no provider attr is bucketed (unknown); because the factor scales priced spend, large unpriced spend inflates the factor — register models for an accurate true-up. The window must align to a reconciliation period for an exact number (cross-period is approximate).
No FX. Costs sum in the rate card’s stated currency with no conversion — use one currency per deployment (USD default); a mixed-currency fleet would sum incorrectly. Multi-currency is a follow-up.
Budget periods reset at UTC midnight (no per-budget timezone / business calendar). rolling has no fixed period, so its alert dedup window slides.
GIN index migrations build non-concurrently (0005, 0007+) — an ACCESS EXCLUSIVE lock for the build on a large telemetry_events. Build CONCURRENTLY or use a maintenance window on a busy deployment.
Response casing is inconsistent. /v1/admin/usage (an AgentUsageOut model) is camelCase, but /usage/daily and /usage/reconciled return raw dicts → snake_case (per_label, label_key). Harmless but a consumer wart; wrap them in aliased models to unify. (Surfaced by the live e2e shakedown.)