Embedded AI Engineer · Take-home submission

Five opportunities found. Two designed in full. One built to run.

Professional Services, COGS, and Support runs on a hidden, repeating task: a person reading messy input, interpreting it, and routing it by hand. This submission finds five high-value instances of that task, designs two of them in full, and builds one that runs.

Function

Pro Services, COGS & Support

Phases

Identify · Design · Build

Author

Jesse, Implementation Team Manager

Use the navigation up top to jump anywhere, or move through in order with the controls at the bottom. An AI assistant grounded in the full submission is available any time via the button on the right.

The through-line

Five instances of the same shape

These are not five unrelated ideas. They are five instances of one recurring problem inside the function. The inputs differ, the systems differ, but the architectural shape is identical every time.

01

Messy input

Unstructured data arrives from a customer, rep, or system, too variable for rules-based automation.

02

Human as transform

A person reads, interprets, cleans, classifies, and routes it. This step is the bottleneck.

03

LLM with grounding

A grounded model does the transform faster and consistently, at any hour.

04

Human gate

A review checkpoint preserves oversight before anything commits downstream.

Because the shape repeats, the first pipeline built creates reusable patterns and shared infrastructure that compound across every build after it. This is a repeatable strategy, not a collection of point solutions.

Phase 1

Five opportunities, ranked by impact

Ranked by labor hours at stake, timeline compression, breadth of beneficiaries, and proximity to a buildable implementation. Expand any card for the short version.

Combined conservative estimate: 3,600 to 6,300 labor hours per year in steady state. The range discounts for overlap between opportunities rather than stacking best cases. For context, the AI team is working toward eliminating 55,000 manual hours this year across many contributors; these five are a measurable contribution to that, not a claim on the whole.

Phase 2 · Architecture A · A classification task

WebCentral Revisions Intelligence Pipeline

WebCentral runs 30 to 40 revision rounds a week. Each one loses one to four days in a manual chain: a PM splits compound requests, classifies each as design or dev, copies them to OneNote, flags scope, and routes to an art director who, on two of four ADs, skips the review entirely. This replaces that chain with classification and routing that runs in seconds and applies one consistent scope standard on every batch.

The flow

The same four-step shape from the framing, made concrete. Input, the model layer, the routing branch, and the human gate that holds on the review path. A caption below explains why the numbering jumps from 01 to 03.

01 · Messy input

Trigger & context assembly

A PADS webhook fires on form submission. n8n pulls the project record and SharePoint URL from Cloud Coach and determines tier (Standard / Premium / Ultimate), then assembles a two-layer retrieval corpus.

03 · LLM with grounding

Two-layer RAG + single-pass classification

Layer 1 is a scope ruleset applied to every tier: what counts as a structural change, what is categorically out of scope. Layer 2 is what this customer actually approved (palette, wireframe, design system), varying by tier. In one pass the model splits compound entries, labels each design / dev / ambiguous, checks both layers, scores confidence, and drafts a PM clarification email if anything is flagged.

Routing decision

Confidence threshold branch

Confidence ≥ 0.75, no flags

Auto-route: write the DRAFT OneNote page and create the art-director ticket.

Below 0.75, or any scope flag

Route to PM first with the structured output, flag detail, and draft email.

04 · Human gate

PM gates the review path only

Every batch writes a DRAFT-marked OneNote page for visibility, the shadow-mode safeguard from day one. But the PM only blocks on the review path: sub-0.75 confidence or any scope flag. On clean, high-confidence batches the flow auto-creates and assigns the art-director ticket with no PM gate, and the PM gets an informational notice only. Art directors flag misclassifications with the existing Cloud Coach tag, which feeds evaluation.

The numbering jumps from 01 to 03 on purpose. Step 02 in the framing is the human transform, the manual bottleneck this pipeline exists to remove, so it is absent from the diagram by design. The flow goes straight from messy input to the model that replaces the hand-off.

Model selection

Structured input, constrained JSON output, pattern-matching against a known ruleset. That is a small-fast-model task. All three candidates reach through OpenRouter on one key, so this is a capability-and-cost call, not an infrastructure one.

Claude Haiku 4.5 Default
Current-gen, near Sonnet-4 reasoning, the capability headroom

$1.00 / $5.00

Claude Haiku 3.5 Eval can promote
Likely sufficient; the cheaper fallback if it scores equally

~$0.26 / ~$1.32

GPT-4o mini
Cost floor, but an aging line that loses on longevity, not price

$0.15 / $0.60

Self-hosted Mistral 7B via Ollama is named as the path if volume scales or governance requires keeping content off third-party APIs. At 30 to 40 runs a week the dollar difference is trivial, so the default takes the headroom and the eval settles the tie on real data.

Evaluation: offline before online

The unlock is that the golden set already exists. WebCentral PMs have hand-parsed, classified, and scope-flagged batches for over 18 months, every result archived as a OneNote page. That archive is a labeled corpus. No hand-labeling to fund.

Deterministic tier

Item count, role assignment, and scope-catching score directly against the archived page where the answer is unambiguous.

LLM-as-judge tier

PMs reworded freely, so semantic match (does this item describe the same request) is rated by a separate model call rather than string comparison.

Regression gate

Every prompt change or model swap re-runs the golden set and is blocked if scores regress. This is where the Haiku 4.5-vs-3.5 decision is settled on score, not assertion.

Feedback loop

Production misclassifications (AD tags, PM fallbacks) become new labeled rows, so each version is tested against the real failures of the one before it.

What gets measured online

Metric	Target
OneNote confirmation rate	> 85% at 60 days
PM review rate	< 30% at steady state
AD correction rate	< 5%
Scope flag precision	> 70%
Processing time (webhook to ticket)	< 5 minutes

Key tradeoffs

Two-layer RAG over single-layer

A project record alone misses categorical scope issues; a ruleset alone misses customer-specific deviations. Both layers are necessary.

Threshold routing over always-PM-first

Mandatory PM review on every batch would recreate the bottleneck. The OneNote confirmation and eval layer are the safeguards instead.

n8n over LangChain

Deterministic shape, one branch point, document-level retrieval against a small corpus. Agentic orchestration would add a dependency the task does not justify.

SharePoint URL pulled from Cloud Coach

Replaces fuzzy folder matching and its ~20% orphan rate with a populated field. One-field addition to a call already being made.

Failure modes instrumented: empty SharePoint URL halts the write rather than guessing, missing retrieval corpus halts the auto-route path, and revision text is treated as untrusted user content to guard against prompt injection.

Phase 2 · Architecture B · A synthesis task

Project Initiation Intelligence Brief

When a project enters the queue, an implementation manager is supposed to read the full account history before assigning a PM. The handoff form meant to capture it is used on one product, filled out poorly, and a cross-product rollout depends on sales buy-in that history says will not come. This pipeline makes that question moot: the information already lives in Cloud Coach and Gong, so the brief is produced automatically, for every product, with no form for sales to fill.

Same class of problem, handled two completely different ways. The contrast is the point: the task type changes, and almost every downstream decision changes with it.

Architecture A

Revisions Pipeline

A classification task

The work

Parse, label, and check items against a known scope ruleset. Bounded pattern-matching.

Model, and why

Small and fast. Claude Haiku 4.5. Constrained JSON output against a ruleset does not need a reasoning-heavy model.

Evaluation

Retrospective. 18 months of archived OneNote pages are labeled ground truth, graded by deterministic scoring plus LLM-as-judge for semantic match.

Human gate

PM confirmation on the review path, informational notice on the auto path. Every batch writes a DRAFT page either way, the shadow-mode safeguard from day one.

Architecture B

Initiation Brief

A synthesis task

The work

Read across email, transcripts, and records, then surface what matters and what is off. Open-ended reasoning.

Model, and why

Larger, for reasoning capacity. Claude Sonnet 4.6. A small model here misses the signal, which is worse than no brief because it creates false confidence.

Evaluation

Forward-looking. No consistent ground truth exists yet, so a PM feedback module built into project close accumulates a labeled dataset over time.

Human gate

Non-gating IM review in place. The brief informs an assignment decision the IM already makes; it does not block it.

The flow

01 · Messy input

Scheduled pull, twice daily

Aligned to assignment windows, n8n queries Cloud Coach for new projects and pulls the opportunity record, the Cirrus Insight email log, cases and deals, and Gong call transcripts and AI summaries per project. No real-time SLA, so a batch loop is simpler than a webhook.

Preprocessing

Deterministic, before the model sees anything

Subject-line dedup keeps only the newest email per thread (it carries the full quoted history), cutting the corpus sharply. A scope delta compares the page-crawl field against the content line item. Gong summaries pass in full. This is what keeps most accounts inside a single LLM call.

03 · LLM with grounding

Single-call synthesis into seven sections

Engagement history, scope integrity, contract friction, expressed goals, escalation flags, integration opportunities, recommended kickoff focus. Each section carries a HIGH / MEDIUM / LOW confidence indicator so a reader knows what to trust. LlamaIndex retrieval is the named escalation path for the rare account that overflows context, not a launch dependency.

04 · Human gate

IM review, in place, non-gating

The brief writes into a structured Cloud Coach object, the CRF replacement itself, visible to IM, PM, and managers. The IM reads it during normal assignment review and can annotate or resolve flags. It improves a decision the human already makes; it does not gate assignment.

Model selection

Synthesis across deduplicated email, transcripts, and records, surfacing anomalies a human IM would catch. A small model here produces something technically a brief but missing the signal, which is worse than no brief because it creates false confidence.

Claude Sonnet 4.6 Default
Current-gen synthesis, near prior-Opus capability; ~$0.06 per brief

$3.00 / $15.00

GPT-4.1 mini Cost-conservative
A genuine option if cost ever outweighs synthesis quality

$0.40 / $1.60

At a heavy hypothetical of 100 projects a week, Sonnet 4.6 runs $25 to $45 a month. Against a tool replacing hundreds of review hours a year, the model cost is immaterial, so the spend goes to the dimension that decides whether IMs trust the output.

Why the evaluation is forward-looking

Architecture A grades against 18 months of consistent archived output. Architecture B cannot, and that is the problem statement, not a gap. The CRF has been filled out poorly for years; grading against that archive would mean measuring against a broken answer key. So the eval is forward-looking from day one: a PM feedback module built into the end-of-project CRF completion (a one-to-five rating plus a conditional reason field) accumulates into a labeled dataset as volume builds. The asymmetry between the two architectures is deliberate.

What gets measured

Metric	Target
Brief generation rate	100%
Escalation flag precision	> 60% at 90 days
PM usefulness rating	> 3.5 / 5.0 average
Thin-record rate	Track (no target)

Key tradeoffs

Sonnet 4.6 over a small model

Synthesis needs reasoning capacity small models lack. The cost gap is a few dollars a week; the quality gap is whether IMs keep reading the brief.

No retrieval layer at launch

Dedup and compact Gong summaries fit the common case in one call. LlamaIndex is named for the overflow edge case, keeping the flow orchestrable in n8n.

CRF object over a standalone document

The brief writes into Cloud Coach as the literal CRF replacement. Review, feedback, and manager visibility live in one record that travels the customer lifecycle.

Scheduled batch over per-project webhook

Assignment is a batched daily activity. Real-time delivery would add notification noise for no operational benefit.

The one real risk: governance

The data this pipeline reads already lives in Cloud Coach and is already viewable by anyone with access there. This is not net-new internal exposure. The only new question is routing that data out to an external model provider, and that is a compliance and contractual clearance, not a technical wall. Because the data is already centralized, it is a more contained question than a fresh data-access request, and the same clearance unblocks every later pipeline that touches these sources.

If that clearance fails, the costed fallback is Ollama running an open-weight model on CivicPlus infrastructure, keeping all content in-house. The tradeoff is stated plainly: a small local model is weaker at exactly the multi-source synthesis this architecture argues for, so self-hosting trades away the capability that justified Sonnet 4.6 in the first place. It is the fallback if clearance fails, not the default.

Phase 3

The working n8n flow

Architecture A, built. When a revision form is submitted, the workflow resolves the project, assembles the two-layer scope context, sends it to Claude Haiku 4.5 via OpenRouter for a single classification pass, writes a DRAFT OneNote page, and routes the batch by confidence and scope. It executes end to end. Every node needing a CivicPlus credential is stubbed, with the real API call documented in its notes.

◆ Open the live workflow ↓ JSON export

The eleven nodes, in order

1

PADS Form Submission
Manual Trigger · stands in for the production PADS webhook

Stub

2

Raw PADS Form Payload
Code · injects the realistic test submission

Stub

3

Cloud Coach API
Code · resolves project by POC email, then PM, with fuzzy org fallback

Stub

4

Locate OneNote Notebook
Code · two Graph calls in production (notebook, then section)

Stub

5

Assemble RAG Context
Code · builds the two-layer scope corpus and the request body

Real

6

LLM Classification
HTTP · live OpenRouter call to anthropic/claude-haiku-4-5

Real

7

Parse LLM Response
Code · parses JSON, computes the routing decision

Real

8

Write OneNote Page
Code · builds the DRAFT page; fires on every batch, both paths

Stub

9

Route by Confidence
Switch · the one branch point in the flow

Real

10

Auto-route: ticket + PM notify
Code · creates the AD ticket and an informational PM notice

Stub

11

PM Review Notification
Code · sends flags and the draft clarification email to the PM

Stub

The two real-logic Code nodes (5 and 7) carry the pipeline's actual intelligence; the one real external call (6) does the classification. Everything stubbed is a CivicPlus credential boundary, not a design gap, and each stub documents its production endpoint inline.

One run, start to finish

A real Standard-tier batch from a test "Westminster CO" submission. The customer dumped multiple requests into one text block, exactly the behavior the manual process struggles with. Here is what went in and what the model returned.

Raw input

1. Change the header background to dark navy and make the logo bigger, also the font in the nav looks too small.

2. The about us page needs a new photo and the staff bios should be moved below the mission statement.

3. Add a search bar to the homepage. We also want to add a live chat widget and integrate our Facebook feed.

Parsed, classified, scope-checked

designHeader background to dark navy⚑

designMake logo bigger⚑

designIncrease nav font size✓

designUpdate About Us photo✓

designMove staff bios below mission⚑

devAdd homepage search bar⚑

devAdd live chat widget⚑

devIntegrate Facebook feed⚑

One text block became eight discrete, classified items. Batch confidence 0.88, above the 0.75 threshold, but scope flags are present, so the flow correctly routes to PM review rather than auto-routing. A draft clarification email to the PM is generated alongside it.

$0.0037

model cost, this run

1,166

tokens (528 in / 638 out)

8

items from one text block

Honest notes on the build

A deliberate over-flag at Standard tier

The model flagged the header color and logo size conservatively. Without a Figma wireframe in context, it cannot tell an in-palette color from an out-of-palette one, so it flags both. At Premium and Ultimate tiers the Figma data resolves this, called out live rather than hidden.

One addition to the design

The OneNote notebook lookup was split into its own node to make the SharePoint resolution visible on the canvas and reflect that production needs two Graph calls before the write. An improvement to the narrative, not a logic change.

The eval-variant was not built

Timeline, not a design objection. The variant would swap the trigger for a golden-set fetch and the OneNote write for a scoring node. The full eval story is walkable from Architecture A.

The routing threshold, scope ruleset, OneNote page structure, due-date logic, and two-branch outcome are all implemented as designed in Architecture A. Nothing was removed that affects the pipeline's logic.

Close

Why this submission, from inside the function

These opportunities are not hypothetical. They come from years of living the pain points, and from tools already built and running against them: a mass-notifications import pipeline with fuzzy matching, a Chrome-extension agenda importer evolving toward headless Playwright, an accessibility scanner with an OpenAI feature and C-suite visibility. An external candidate can guess where the work is slow. This is knowing.

Three levels of depth on one class of problem: five instances found, two designed in full, one built to run.

Right-sizing, both directions

A small fast model for classification, a synthesis model where missing the signal erodes trust. The decision is the tier; the eval settles the specific model.

Evaluation as a first-class part

A retrospective golden set where consistent ground truth exists, a forward-looking feedback module where it does not. The asymmetry is deliberate.

Built with AI, visibly

This presentation, its assistant, and the architectures were developed with AI in the loop throughout, the way the role itself expects to work.