How we show our work

We don't assert defensibility. We build it.

Almost every vendor calls their work "traceable" and "auditable." Almost none of them can show what that means in code. These four techniques are what it means in ours. Each one is drawn from a system we've already shipped.

Evidence YAML ledgers

From the DARPA program analysis sites

See the case study

The problem

A claim without a receipt is a marketing statement. When a reviewer asks "how do you know," the answer shouldn't be "the analyst said so."

How it works

Every quantitative claim on a program analysis site is backed by a YAML entry colocated with the component that displays it. The YAML captures the exact claim text, a confidence level (HIGH, MODERATE, or LOW), and a reproducible source — usually a SQL query against the warehouse. Before any deploy, a verification pass walks the entire site, runs every query, and compares the results to what the ledger expects. A discrepancy stops the build.

optomechanics-claim.yaml · colocated with the section component

claim_id: china_optomechanics_share
claim: "China accounts for 36.9% of optomechanics publication share."
confidence: HIGH
sources:
  - type: database_query
    database: supabase
    table: program_citations
    query: |
      SELECT country, COUNT(DISTINCT citing_doi) AS papers
      FROM program_citations
      WHERE research_community = 'optomechanics'
      GROUP BY country
      ORDER BY papers DESC;
    expected_results:
      china_share: 0.369

Verified versus inferred labeling

From the Andrew W. Marshall Foundation archive

See the case study

The problem

AI output looks the same whether it came from a verified source or a fabrication. A researcher reading an AI summary has no way to tell what the archive actually says and what the model guessed.

How it works

Every node in the archive's knowledge graph carries a label. Verified means the content came from an archive source. Inferred means the AI generated it during reasoning. Verified nodes render as filled circles. Inferred nodes render as dashed outlines. Every answer in the chat panel resolves its citations to specific passages in specific documents. Inferred claims get a separate visual tag, so a careful reader can always tell what the archive says from what the model extrapolated.

visual contract in the archive console

Verified

sourced from the archive

Inferred

generated by AI reasoning

chat citations: 1 verified archive passage · i inferred hypothesis

Delegation contracts

From the Georgetown Wicked Problems Lab

See the case study

The problem

An AI that's eager to give the answer is a terrible teaching tool. It's also a terrible decision-support tool in any setting where the human has to own the call.

How it works

The Georgetown platform runs a delegation contract written directly into the code. The AI cannot verify assumptions. It cannot define success criteria. It cannot advance the team past a phase gate. Those gates are PL/pgSQL stored procedures — the database itself enforces the constraint, not a hope that the model will behave. The contract is small. The effect is large. It changes the whole shape of what the product is for.

src/lib/orchestrator/delegation-contract.ts

export const DELEGATION_CONTRACT = {
  // The AI observes and assists. The human decides.
  canVerifyAssumptions: false,
  canDefineSuccessCriteria: false,
  canAdvancePhaseGate: false,

  // Gates are enforced at the database layer, not in application code,
  // so no agent can call its way past them.
  gateEnforcement: 'plpgsql_stored_procedure',
} as const;

W3C PROV lineage

From Deepfield

See the case study

The problem

A recommendation you can't trace to its source isn't defensible. A complex pipeline with nine reasoning stages can produce a confident conclusion that nobody on the team can walk backward through.

How it works

Deepfield carries W3C PROV-compliant lineage through every stage of its pipeline. Pick any final recommendation and you can walk backward: the course of action it came from, the reasoning tree that scored it, the assumptions those rollouts depended on, the asymmetries they exploited, the graph structure, the extracted entities, the source passages. Every module returns a `Result<T,E>`, so failures propagate explicitly. Silent fallbacks that hide a broken call are forbidden by the architecture.

walking a recommendation backward through the pipeline

recommendation: "Invest in domain X before Q3"
  <- course_of_action: COA-07 (Pareto rank 2)
     <- reasoning_rollout: rollout_1834 (PRM score 0.81)
        <- assumption: "competitor lag is ~14 months"
           <- asymmetry: learning_rate_gap (confidence 0.72)
              <- graph_node: entity_3019 (competitor capability)
                 <- extraction: passage from source_882
                    <- research_iteration: gap_fill_3
                       <- intake_query: initial user question

every arrow is a W3C PROV edge recorded at pipeline time

"Defensible" is either a claim or a property of the system. In our work, it's a property of the system.

Initiate Contact

Ready to transform your
decision architecture?

Tell us about the decision you're trying to improve. We'll schedule a briefing with our principals to understand your environment and explore a potential fit.

Schedule a Briefing

How we show our work

We don't assert defensibility. We build it.

Evidence YAML ledgers

Verified versus inferred labeling

Delegation contracts

W3C PROV lineage

Initiate Contact

Ready to transform your decision architecture?

Ready to transform your
decision architecture?