0 → 1 builder journey

Search Metric
Analyzer

Building a complex AI system through systematic architecture — without writing code

01Understand
02Hypothesize
03Dispatch
04Synthesize
02

The Problem:
Knowledge Is Scattered

Every investigation re-assembles the same puzzle from scratch. The structured legwork repeats, while the judgment calls that matter get squeezed.

What if we liberate DS productivity — let a tool do the structured legwork, so humans focus on the judgment calls that actually matter?
04

Why a 4-Stage Pipeline?

Because that's how a senior DS actually investigates. The pipeline encodes that discipline.

01

Understand

What moved? By how much? Is the data even trustworthy?

02

Hypothesize

What could explain this? Rank theories by likelihood.

03

Dispatch

Test each theory against the data. Supported, rejected, or inconclusive.

04

Synthesize

What's the answer? How confident are we? What should we do next?

05

Demo: A Real Investigation

Click Quality drops 6.2% for Standard tier. The tool runs:

Understand

Data quality passes. Step-change detected 3 days ago. Concentrated in Standard tier.

Hypothesize

Top theory: ranking model change. AI metrics stable — rules out AI adoption effect.

Dispatch

Standard tier explains 92% of the movement. Top-3 positions most affected.

Synthesize

High confidence: ranking model regression. Recommends rollback review → Ranking team.

06

Real Output: Investigation Report

INV-2024-037
Complete
Why did Click Quality drop 6.2% for Standard tier last week?
Bottom Line Ranking model v3.2 rollout degraded result relevance for Standard tier. Top-3 position click-through rates dropped 8.1%. Recommend rollback review with Ranking team.
Click Quality
-6.2%
Standard tier
Search Quality
-1.1%
AI offset
AI Trigger
+2.4%
Stable
Confidence
HIGH
3 evidence lines
Explained
92%
of movement

Structured output ready to drop into Slack or an incident review — root cause, evidence, confidence, and next steps.

07

Full Transparency:
Execution Trace

Every decision is logged. Reviewers can expand any phase to see exactly what the tool did and why.

Phase 0: Classification 1.2s
KNOWLEDGE Load metric definitions 0.3s
REASONING Classify metric type → click-based 0.9s
Phase 1: Data Quality & Schema 3.8s
SQL QUERY Validate data completeness (7-day window) 1.1s
KNOWLEDGE Load co-movement patterns 0.2s
REASONING Data quality gate → PASS 0.8s
SQL QUERY Dimensional decomposition (6 dims) 1.7s
Phase 2: Hypothesis Testing 5.2s
REASONING Generate 7 candidate hypotheses 1.4s
SQL QUERY Test ranking model change hypothesis 2.1s
Phase 3: Synthesis & Report 2.1s
REASONING Synthesize findings → root cause 1.5s

Total: 12.3s · 4 phases · 11 steps · Full audit trail for every investigation

08

The Domain:
Enterprise Search

A multi-layered system where a change at any level ripples through every metric downstream.

Query Understanding
Document Retrieval
Result Ranking
Final Results + AI Answers

50+ Data Sources

Confluence, Jira, Slack, Google Drive, SharePoint — each a potential source of metric movement when something breaks.

Segments Mask Each Other

Standard, Premium, Enterprise tiers. AI answers on top. A regression in one segment can mask improvement in another.

The tool encodes domain knowledge as structured YAML — metric formulas, co-movement patterns, known incident signatures, hypothesis priority order — not just prompt instructions.

09

How It Works Under the Hood

User Question
Phase 0: Classify
Playbook
SEV Archive
Metric Registry
Lead Agent
DISPATCH
Sub-Agent H0
Rate Decomp
Sub-Agent H1
Playbook
Sub-Agent H2
User or Playbook
Schema Catalog
Corrections Index
Query, Compare, Explain
Report + Transparency Log
10

The Builder Journey

Not vibe coding. This system was built by giving AI the right context — specs, tests, and domain knowledge — so it could build something complex on my behalf.

Method
Specs Before Code
Every feature starts with a written spec: what it does, what "done" looks like, edge cases. The AI builds against the spec, not a vague prompt.
Method
Tests Before Code
571 tests define the expected behavior. The AI writes code that passes them — not code that "looks right."
Method
Expert Reviews Before Building
Architecture decisions stress-tested against DS Lead, PM Lead, and Senior Eng personas — simulating the IC review that separates good from great.
11

Domain Expert on Demand

Domain-specific AI agents fail without expert pressure-testing. Generic LLMs get easy cases right and hard cases catastrophically wrong.

Why Generic AI Fails Here

  • Doesn't know your metric formulas
  • Can't tell a real regression from an AI adoption signal
  • Hallucinate plausible-sounding but wrong diagnoses

Persona-Based Review

  • DS Lead — metric validity, methodology gaps
  • PM Lead — business value, adoption risk
  • Senior Eng — system design, failure modes

One review caught that the system's quality checks were weakest at the most critical stage — the final diagnosis. That single finding reshaped the entire v2.

12

v2: Better Diagnosis Quality,
Structurally Enforced

v1 worked but relied on the AI "following instructions." v2 makes quality checks structural — the system validates its own reasoning at every stage.

What changed
Validation Rules
11 domain rules checked at every stage boundary. Example: "If AI metrics are stable, do not attribute click quality drops to AI adoption."
What changed
Decision Audit Trail
Every key decision — which direction is "bad," which theories were tested, what conclusion was drawn — is logged and reviewable.
What changed
Graduated Strictness
Early stages (data parsing) fail hard on errors. Later stages (final diagnosis) get a second chance before failing — matching the stakes at each step.
13

Building Rigor, Honestly

Rigorous offline and online eval is still being built. Here's where we are.

What We Have

  • 571 tests — all passing, 0 failures
  • 6 eval scenarios with scoring rubrics
  • Known-answer tests — must find the right root cause, must not blame the wrong thing

What We're Building

  • More scenarios — edge cases, adversarial inputs
  • Online eval — compare against real human investigations
  • Confidence calibration — does "High confidence" actually mean high accuracy?
14

Why AI-Driven DS
Solutions Are Hard

Building it is 30%. Getting people to trust and use it is 70%.

Challenge
Domain Accuracy
One bad diagnosis erodes months of trust. Domain constraints are non-negotiable.
Challenge
Hallucination Risk
A plausible-sounding wrong root cause is worse than no answer. Senior ICs will test it on known cases.
Challenge
Internal Adoption
Trust is earned one investigation at a time. Show the work, state confidence honestly, be useful on day one.

The critical path: show the work, earn IC trust, start with structured legwork, add judgment later.

15

What's Next

W3

The Tool Learns From Corrections

When a DS says "that's wrong," the system remembers. Accumulated corrections improve future investigations.

W4

End-to-End Automated Pipeline

Full investigation runs autonomously. Self-assessment at the final stage: "How confident am I in this diagnosis?"

v3

Eval Against Real Investigations

Compare tool output against how experienced DSs actually diagnosed the same metric movements.

North star: a tool that earns trust by being transparent, domain-accurate, and honest about what it doesn't know.

Eng Focus
E1

What the Tool Knows
About Your System

Domain knowledge is encoded as structured YAML — not prompt instructions. The tool reasons from data, not vibes.

# The AI adoption trap — encoded as a rule, not a suggestion
pattern: ai_answers_working
signal:
  click_quality: DOWN
  search_quality_success: STABLE or UP
  ai_trigger: UP
verdict: POSITIVE — AI answers cannibalizing clicks by design
action: Do NOT treat as regression

# Connector outage signature — skip 30min of decomposition
pattern: connector_auth_expiry
signal:
  zero_result_rate: SPIKE
  connector_health: FAILURES in single connector
  onset: gradual
shortcut: Jump to connector root cause directly

5 YAML knowledge files: metric definitions, co-movement patterns, historical incidents, hypothesis priority, architecture tradeoffs.

Eng Focus
E2

v2 Architecture:
How Enforcement Works

Quality rules run at stage boundaries — not as prompt instructions the AI might ignore.

# Seam validator: runs between HYPOTHESIZE → DISPATCH
rule: hypotheses_consistent_with_co_movement
check: "If AI metrics stable, no hypothesis may
        attribute click_quality drop to AI adoption"
on_violation: REJECT hypothesis, log to trace

# Seam validator: runs at SYNTHESIZE output
rule: effect_size_proportionality
check: "Root cause segment contribution must be
        proportional to claimed severity (P0/P1/P2)"
on_violation: RETRY once, then SOFT WARN
Eng Focus
E3

Build Your Own:
The Reusable Pattern

This isn't a one-off project. It's a pattern any eng team can follow to build domain-specific AI agents.

01

Pick a Repetitive Investigation

What does your team debug the same way every time? Incident triage, regression analysis, capacity planning — any structured legwork.

02

Encode Domain Knowledge as Data

YAML files, not prompt instructions. Metric formulas, known patterns, decision trees. Structured and versionable.

03

Build Stages That Mirror Expert Thinking

How does your best IC investigate? Encode that sequence. Add validation rules between stages.

04

Review With Domain Experts Before Shipping

Have your senior ICs stress-test the tool on known cases. Their "that's wrong" is your best test suite.

Thank You

Search Metric Analyzer — built from zero through specs, tests, and expert reviews.

questions welcome

Built with

Claude Code

Specs + Tests + Expert Reviews

571 tests · 6 eval scenarios · 11 validation rules