0 → 1 builder journey

Search Metric
Analyzer

Building a complex AI system through systematic architecture — without writing code

01Understand

02Hypothesize

03Dispatch

04Synthesize

02

The Problem:
Knowledge Is Scattered

Data tables — metric logs, dimensional breakdowns, time series across tenant tiers and connector types
DS domain knowledge — metric formulas, co-movement patterns, what "normal" looks like per segment — lives in people's heads
Eng domain knowledge — what systems changed, connector health, model rollouts, experiment ramps — scattered across teams

Every investigation re-assembles the same puzzle from scratch. The structured legwork repeats, while the judgment calls that matter get squeezed.

What if we liberate DS productivity — let a tool do the structured legwork, so humans focus on the judgment calls that actually matter?

04

Why a 4-Stage Pipeline?

Because that's how a senior DS actually investigates. The pipeline encodes that discipline.

01

Understand

What moved? By how much? Is the data even trustworthy?

02

Hypothesize

What could explain this? Rank theories by likelihood.

03

Dispatch

Test each theory against the data. Supported, rejected, or inconclusive.

04

Synthesize

What's the answer? How confident are we? What should we do next?

05

Demo: A Real Investigation

Click Quality drops 6.2% for Standard tier. The tool runs:

✓

Understand

Data quality passes. Step-change detected 3 days ago. Concentrated in Standard tier.

✓

Hypothesize

Top theory: ranking model change. AI metrics stable — rules out AI adoption effect.

✓

Dispatch

Standard tier explains 92% of the movement. Top-3 positions most affected.

✓

Synthesize

High confidence: ranking model regression. Recommends rollback review → Ranking team.

06

Real Output: Investigation Report

INV-2024-037

Complete

Why did Click Quality drop 6.2% for Standard tier last week?

Bottom Line Ranking model v3.2 rollout degraded result relevance for Standard tier. Top-3 position click-through rates dropped 8.1%. Recommend rollback review with Ranking team.

Click Quality

-6.2%

Standard tier

Search Quality

-1.1%

AI offset

AI Trigger

+2.4%

Stable

Confidence

HIGH

3 evidence lines

Explained

92%

of movement

Structured output ready to drop into Slack or an incident review — root cause, evidence, confidence, and next steps.

07

Full Transparency:
Execution Trace

Every decision is logged. Reviewers can expand any phase to see exactly what the tool did and why.

▸ Phase 0: Classification 1.2s

KNOWLEDGE Load metric definitions 0.3s

REASONING Classify metric type → click-based 0.9s

▾ Phase 1: Data Quality & Schema 3.8s

SQL QUERY Validate data completeness (7-day window) 1.1s

KNOWLEDGE Load co-movement patterns 0.2s

REASONING Data quality gate → PASS 0.8s

SQL QUERY Dimensional decomposition (6 dims) 1.7s

▸ Phase 2: Hypothesis Testing 5.2s

REASONING Generate 7 candidate hypotheses 1.4s

SQL QUERY Test ranking model change hypothesis 2.1s

▸ Phase 3: Synthesis & Report 2.1s

REASONING Synthesize findings → root cause 1.5s

Total: 12.3s · 4 phases · 11 steps · Full audit trail for every investigation

08

The Domain:
Enterprise Search

A multi-layered system where a change at any level ripples through every metric downstream.

Query Understanding

Document Retrieval

Result Ranking

Final Results + AI Answers

50+ Data Sources

Confluence, Jira, Slack, Google Drive, SharePoint — each a potential source of metric movement when something breaks.

Segments Mask Each Other

Standard, Premium, Enterprise tiers. AI answers on top. A regression in one segment can mask improvement in another.

The tool encodes domain knowledge as structured YAML — metric formulas, co-movement patterns, known incident signatures, hypothesis priority order — not just prompt instructions.

09

How It Works Under the Hood

User Question

Phase 0: Classify

Playbook

SEV Archive

Metric Registry

Lead Agent

DISPATCH

Sub-Agent H0
Rate Decomp

Sub-Agent H1
Playbook

Sub-Agent H2
User or Playbook

Schema Catalog

Corrections Index

Query, Compare, Explain

Report + Transparency Log

10

The Builder Journey

Not vibe coding. This system was built by giving AI the right context — specs, tests, and domain knowledge — so it could build something complex on my behalf.

Method

Specs Before Code

Every feature starts with a written spec: what it does, what "done" looks like, edge cases. The AI builds against the spec, not a vague prompt.

Method

Tests Before Code

571 tests define the expected behavior. The AI writes code that passes them — not code that "looks right."

Method

Expert Reviews Before Building

Architecture decisions stress-tested against DS Lead, PM Lead, and Senior Eng personas — simulating the IC review that separates good from great.

11

Domain Expert on Demand

Domain-specific AI agents fail without expert pressure-testing. Generic LLMs get easy cases right and hard cases catastrophically wrong.

Why Generic AI Fails Here

Doesn't know your metric formulas
Can't tell a real regression from an AI adoption signal
Hallucinate plausible-sounding but wrong diagnoses

Persona-Based Review

DS Lead — metric validity, methodology gaps
PM Lead — business value, adoption risk
Senior Eng — system design, failure modes

One review caught that the system's quality checks were weakest at the most critical stage — the final diagnosis. That single finding reshaped the entire v2.

12

v2: Better Diagnosis Quality,
Structurally Enforced

v1 worked but relied on the AI "following instructions." v2 makes quality checks structural — the system validates its own reasoning at every stage.

What changed

Validation Rules

11 domain rules checked at every stage boundary. Example: "If AI metrics are stable, do not attribute click quality drops to AI adoption."

What changed

Decision Audit Trail

Every key decision — which direction is "bad," which theories were tested, what conclusion was drawn — is logged and reviewable.

What changed

Graduated Strictness

Early stages (data parsing) fail hard on errors. Later stages (final diagnosis) get a second chance before failing — matching the stakes at each step.

13

Building Rigor, Honestly

Rigorous offline and online eval is still being built. Here's where we are.

What We Have

571 tests — all passing, 0 failures
6 eval scenarios with scoring rubrics
Known-answer tests — must find the right root cause, must not blame the wrong thing

What We're Building

More scenarios — edge cases, adversarial inputs
Online eval — compare against real human investigations
Confidence calibration — does "High confidence" actually mean high accuracy?

14

Why AI-Driven DS
Solutions Are Hard

Building it is 30%. Getting people to trust and use it is 70%.

Challenge

Domain Accuracy

One bad diagnosis erodes months of trust. Domain constraints are non-negotiable.

Challenge

Hallucination Risk

A plausible-sounding wrong root cause is worse than no answer. Senior ICs will test it on known cases.

Challenge

Internal Adoption

Trust is earned one investigation at a time. Show the work, state confidence honestly, be useful on day one.

The critical path: show the work, earn IC trust, start with structured legwork, add judgment later.

15

What's Next

W3

The Tool Learns From Corrections

When a DS says "that's wrong," the system remembers. Accumulated corrections improve future investigations.

W4

End-to-End Automated Pipeline

Full investigation runs autonomously. Self-assessment at the final stage: "How confident am I in this diagnosis?"

v3

Eval Against Real Investigations

Compare tool output against how experienced DSs actually diagnosed the same metric movements.

North star: a tool that earns trust by being transparent, domain-accurate, and honest about what it doesn't know.

Eng Focus

E1

What the Tool Knows
About Your System

Domain knowledge is encoded as structured YAML — not prompt instructions. The tool reasons from data, not vibes.

# The AI adoption trap — encoded as a rule, not a suggestion
pattern: ai_answers_working
signal:
  click_quality: DOWN
  search_quality_success: STABLE or UP
  ai_trigger: UP
verdict: POSITIVE — AI answers cannibalizing clicks by design
action: Do NOT treat as regression

# Connector outage signature — skip 30min of decomposition
pattern: connector_auth_expiry
signal:
  zero_result_rate: SPIKE
  connector_health: FAILURES in single connector
  onset: gradual
shortcut: Jump to connector root cause directly

5 YAML knowledge files: metric definitions, co-movement patterns, historical incidents, hypothesis priority, architecture tradeoffs.

Eng Focus

E2

v2 Architecture:
How Enforcement Works

Quality rules run at stage boundaries — not as prompt instructions the AI might ignore.

# Seam validator: runs between HYPOTHESIZE → DISPATCH
rule: hypotheses_consistent_with_co_movement
check: "If AI metrics stable, no hypothesis may
        attribute click_quality drop to AI adoption"
on_violation: REJECT hypothesis, log to trace

# Seam validator: runs at SYNTHESIZE output
rule: effect_size_proportionality
check: "Root cause segment contribution must be
        proportional to claimed severity (P0/P1/P2)"
on_violation: RETRY once, then SOFT WARN

11 rules across 4 stage boundaries — from data quality gates to narrative coherence checks
Tiered enforcement — hard fail early (data parsing), retry + soft warn late (final diagnosis)
Full trace — every decision logged: which direction was "bad," which hypotheses tested, what conclusion drawn

Eng Focus

E3

Build Your Own:
The Reusable Pattern

This isn't a one-off project. It's a pattern any eng team can follow to build domain-specific AI agents.

01

Pick a Repetitive Investigation

What does your team debug the same way every time? Incident triage, regression analysis, capacity planning — any structured legwork.

02

Encode Domain Knowledge as Data

YAML files, not prompt instructions. Metric formulas, known patterns, decision trees. Structured and versionable.

03

Build Stages That Mirror Expert Thinking

How does your best IC investigate? Encode that sequence. Add validation rules between stages.

04

Review With Domain Experts Before Shipping

Have your senior ICs stress-test the tool on known cases. Their "that's wrong" is your best test suite.

Thank You

Search Metric Analyzer — built from zero through specs, tests, and expert reviews.

questions welcome

Built with

Claude Code

Specs + Tests + Expert Reviews

571 tests · 6 eval scenarios · 11 validation rules

Search MetricAnalyzer

The Problem:Knowledge Is Scattered

Why a 4-Stage Pipeline?

Understand

Hypothesize

Dispatch

Synthesize

Demo: A Real Investigation

Understand

Hypothesize

Dispatch

Synthesize

Real Output: Investigation Report

Full Transparency:Execution Trace

The Domain:Enterprise Search

50+ Data Sources

Segments Mask Each Other

How It Works Under the Hood

The Builder Journey

Domain Expert on Demand

Why Generic AI Fails Here

Persona-Based Review

v2: Better Diagnosis Quality,Structurally Enforced

Building Rigor, Honestly

What We Have

What We're Building

Why AI-Driven DSSolutions Are Hard

What's Next

The Tool Learns From Corrections

End-to-End Automated Pipeline

Eval Against Real Investigations

What the Tool KnowsAbout Your System

v2 Architecture:How Enforcement Works

Build Your Own:The Reusable Pattern

Pick a Repetitive Investigation

Encode Domain Knowledge as Data

Build Stages That Mirror Expert Thinking

Review With Domain Experts Before Shipping

Thank You

Search Metric
Analyzer

The Problem:
Knowledge Is Scattered

Full Transparency:
Execution Trace

The Domain:
Enterprise Search

v2: Better Diagnosis Quality,
Structurally Enforced

Why AI-Driven DS
Solutions Are Hard

What the Tool Knows
About Your System

v2 Architecture:
How Enforcement Works

Build Your Own:
The Reusable Pattern