Building a complex AI system through systematic architecture — without writing code
Every investigation re-assembles the same puzzle from scratch. The structured legwork repeats, while the judgment calls that matter get squeezed.
What if we liberate DS productivity — let a tool do the structured legwork, so humans focus on the judgment calls that actually matter?
Because that's how a senior DS actually investigates. The pipeline encodes that discipline.
What moved? By how much? Is the data even trustworthy?
What could explain this? Rank theories by likelihood.
Test each theory against the data. Supported, rejected, or inconclusive.
What's the answer? How confident are we? What should we do next?
Click Quality drops 6.2% for Standard tier. The tool runs:
Data quality passes. Step-change detected 3 days ago. Concentrated in Standard tier.
Top theory: ranking model change. AI metrics stable — rules out AI adoption effect.
Standard tier explains 92% of the movement. Top-3 positions most affected.
High confidence: ranking model regression. Recommends rollback review → Ranking team.
Structured output ready to drop into Slack or an incident review — root cause, evidence, confidence, and next steps.
Every decision is logged. Reviewers can expand any phase to see exactly what the tool did and why.
Total: 12.3s · 4 phases · 11 steps · Full audit trail for every investigation
A multi-layered system where a change at any level ripples through every metric downstream.
Confluence, Jira, Slack, Google Drive, SharePoint — each a potential source of metric movement when something breaks.
Standard, Premium, Enterprise tiers. AI answers on top. A regression in one segment can mask improvement in another.
The tool encodes domain knowledge as structured YAML — metric formulas, co-movement patterns, known incident signatures, hypothesis priority order — not just prompt instructions.
Not vibe coding. This system was built by giving AI the right context — specs, tests, and domain knowledge — so it could build something complex on my behalf.
Domain-specific AI agents fail without expert pressure-testing. Generic LLMs get easy cases right and hard cases catastrophically wrong.
One review caught that the system's quality checks were weakest at the most critical stage — the final diagnosis. That single finding reshaped the entire v2.
v1 worked but relied on the AI "following instructions." v2 makes quality checks structural — the system validates its own reasoning at every stage.
Rigorous offline and online eval is still being built. Here's where we are.
Building it is 30%. Getting people to trust and use it is 70%.
The critical path: show the work, earn IC trust, start with structured legwork, add judgment later.
When a DS says "that's wrong," the system remembers. Accumulated corrections improve future investigations.
Full investigation runs autonomously. Self-assessment at the final stage: "How confident am I in this diagnosis?"
Compare tool output against how experienced DSs actually diagnosed the same metric movements.
North star: a tool that earns trust by being transparent, domain-accurate, and honest about what it doesn't know.
Domain knowledge is encoded as structured YAML — not prompt instructions. The tool reasons from data, not vibes.
# The AI adoption trap — encoded as a rule, not a suggestion pattern: ai_answers_working signal: click_quality: DOWN search_quality_success: STABLE or UP ai_trigger: UP verdict: POSITIVE — AI answers cannibalizing clicks by design action: Do NOT treat as regression # Connector outage signature — skip 30min of decomposition pattern: connector_auth_expiry signal: zero_result_rate: SPIKE connector_health: FAILURES in single connector onset: gradual shortcut: Jump to connector root cause directly
5 YAML knowledge files: metric definitions, co-movement patterns, historical incidents, hypothesis priority, architecture tradeoffs.
Quality rules run at stage boundaries — not as prompt instructions the AI might ignore.
# Seam validator: runs between HYPOTHESIZE → DISPATCH rule: hypotheses_consistent_with_co_movement check: "If AI metrics stable, no hypothesis may attribute click_quality drop to AI adoption" on_violation: REJECT hypothesis, log to trace # Seam validator: runs at SYNTHESIZE output rule: effect_size_proportionality check: "Root cause segment contribution must be proportional to claimed severity (P0/P1/P2)" on_violation: RETRY once, then SOFT WARN
This isn't a one-off project. It's a pattern any eng team can follow to build domain-specific AI agents.
What does your team debug the same way every time? Incident triage, regression analysis, capacity planning — any structured legwork.
YAML files, not prompt instructions. Metric formulas, known patterns, decision trees. Structured and versionable.
How does your best IC investigate? Encode that sequence. Add validation rules between stages.
Have your senior ICs stress-test the tool on known cases. Their "that's wrong" is your best test suite.
Search Metric Analyzer — built from zero through specs, tests, and expert reviews.
Built with
Claude Code
Specs + Tests + Expert Reviews
571 tests · 6 eval scenarios · 11 validation rules