How to make AI cold email actually convert: the copy optimisation loop

Most cold-email operators in 2026 are still optimising on a metric that died in September 2021.

Apple Mail Privacy Protection rolled out with iOS 15. Apple's proxy servers pre-fetch every tracking pixel in every email sent to an Apple Mail user, regardless of whether a human ever opens the message. The open beacon fires automatically. Litmus, beehiiv, and Mailforge have spent four years documenting the same phenomenon: roughly 49% of all tracked opens now come from MPP, and reported open rates have inflated by 10 to 15 percentage points across the board. A campaign reporting 38% opens is doing somewhere closer to 24% real human engagement, and the exact gap is unknowable on any individual account.

So most operators look at the next metric down. Reply rate is closer to truth. It still lies. It counts unsubscribes, auto-responders, the polite "remove me from your list" reply, and the angry "stop emailing me" reply. None of those is a buyer raising her hand.

The only metric worth optimising for is positive reply rate. The leads who actually wanted to talk. In Instantly that surfaces as i_status=1 on the lead record and ai_interest_value=1 on the reply. Everything else is noise dressed up as signal.

This piece is the AI showcase that sits inside the cold outbound engagements we run for clients. It's the layer most outbound shops cannot operate, because it requires Claude in the loop reading the actual reply text every cycle and making real decisions about what to ship next. The walkthrough below is the playbook, end to end, for anyone who wants to operate the same loop on their own programme. The earlier layers (infrastructure, lead generation, personalisation) sit underneath this one but it stands alone if you've already got volume going out.

The single metric

Positive reply rate is the only number we tune.

It is computed cleanly. Total positive replies divided by total emails sent in the cycle window, where positive is defined by Instantly's AI interest classifier. The classifier reads the reply text and tags it with i_status=1 (interested), i_status=2 (meeting booked), i_status=3 (not interested), or no status at all (auto-responder, unsubscribe, bounce, gibberish). Only the first two count.

Why nothing else gets a vote. Open rate has been functionally dead since iOS 15 shipped Apple Mail Privacy Protection in September 2021. Mailforge and beehiiv have published the same finding repeatedly: an inflated open beacon fires on every Apple Mail recipient regardless of human engagement, and Apple Mail accounts for roughly half the inbox share in B2B. Optimising copy on opens means optimising copy for an Apple proxy server that does not buy software.

Click rate barely applies. We do not put links in cold email body copy because links trip spam filters and signal phishing. There is nothing to click.

Total reply rate is too noisy at the volumes we run. A typical Pipeline campaign sends 3,000 emails a day across 80 to 150 mailboxes. At an industry-average 3% reply rate, that is 90 replies a day, of which maybe 10 are actually positive. The other 80 are negatives, OOOs, "wrong person" replies, "not now" replies, and the rare angry one. Optimising on the 90 means optimising for the 80 you do not want.

Positive reply rate strips all of that out. SalesHandy puts the median B2B positive reply rate at around 2 to 3% in 2026. Top quartile is 5%+ and the operators clearing 10% are doing it on the back of signal-personalised campaigns and a tight ICP. That is the number we move.

The nine-phase loop

A cron job fires every six hours. When any enrolled campaign has 250 new sends per variant since the previous cycle, the orchestrator wakes up and runs nine phases in order. The whole thing takes about three minutes per cycle on a healthy campaign and produces a markdown report stored in Supabase before going back to sleep.

1. Sync

We pull the new sent and reply data from Instantly, incrementally. The previous cycle's last-seen timestamp is the cursor, and only emails sent since that cursor flow into the staging tables. The reply classifier output (i_status, ai_interest_value, reply text) lands alongside each row. We deliberately fetch only the delta, so a 90-day campaign re-syncs in seconds.

2. Compute

The metrics computer runs over the full campaign history. The latest sync's delta gets folded in alongside everything from prior cycles. Every variant gets per-step send counts, positive reply counts, positive reply rates, Fisher's exact p-values against the other variants in its step, and per-spintax-option performance. The trajectory section tracks rate movement across runs, so a slow drift becomes visible long before any single cycle would catch it.

3. Revert check

Before any new decision lands, we check whether the previous cycle's pushed changes hurt overall positive reply rate. If the rate has dropped versus the snapshot taken at the previous push, the system rolls the campaign back to the last known good sequences. The cycle aborts after the revert. No new copy gets generated on a regression cycle, because the regression is the signal.

4. Assess prior

Every prior cycle wrote pending entries to experiment_log with the hypothesis, the variant changed, and the prior performance baseline. This phase backfills the post-change performance on those entries now that enough sends have accumulated. Hypotheses with positive deltas get marked confirmed. Flat or negative deltas get marked failed. This is what makes the chain cumulative. Without it the loop would forget every prior cycle and re-run the same failed experiments forever.

5. Decide

For each step, the decision engine runs four tests against the variant pair. Fisher's exact test below p=0.20 declares a statistical winner. Zero positive replies on one variant after 250 sends declares a structural loser. Two consecutive cycles favouring the same variant declares a directional winner. A rate ratio at or above 3:1 with at least three positives on the winner declares a confidence winner. Failing all four leaves the pair as inconclusive.

6. Generate

The Claude Code CLI runs in headless mode (claude --bare -p). The system prompt is the writing methodology in full. The user prompt carries the business context, both variant templates, the spintax performance data, every email that received a positive reply with the recipient's actual reply text, sample non-reply emails for contrast, the experiment history, and the trajectory. Sonnet returns a JSON object with analysis, hypothesis, subject template, body template, and spintax rationale.

7. Push

The pusher takes a snapshot of the current sequences first. Then it pauses the campaign in Instantly so no leads see a half-edited template. It PATCHes the new variant into the sequence and re-activates the campaign. The whole thing sits inside a try/finally that re-activates on any exception, so a transient API failure cannot leave the campaign frozen mid-edit.

8. Log

The new variant gets appended to experiment_log with the hypothesis, the variant index, the prior performance baseline, and a pending status awaiting the next cycle's outcome assessment. The analysis_runs row records changes pushed, the revert flag, the model version, and the run state. Concurrent cycles on the same campaign are blocked by a 2-hour stale-lock timeout in the same table.

9. Report

Sonnet writes a full markdown report covering the metrics, the decisions made, the hypotheses fired, the spintax movements, the winning email analysis, and what the next cycle is watching for. The report lands in cycle_reports and is queryable by campaign id. I read every report by hand for the first three cycles of any new campaign. After that the system runs unattended and the reports become an audit trail.

The hypothesis chain

Every copy change that ships carries a single specific testable hypothesis. Without one, the loop would just spin copy at random and end up where it started.

A hypothesis from a real cycle reads like this: "the question-led opening outperforms the statement opening because the recipient processes a question as a request for engagement, which lifts the cognitive priority of the email above other inbox items competing for the same five seconds of attention." It names the change, names the mechanism, and names the audience-level reason it should work for this specific list.

The next cycle either confirms it or fails it. If positive reply rate on the new variant beats the prior baseline by a margin that clears the decision tests, the hypothesis is marked confirmed and feeds back into the writing prompt as a known-good pattern for future cycles on this campaign. If it flatlines or regresses, the hypothesis goes into a do-not-repeat list. The dedup runs as a string-similarity check, so a lightly reworded retry of a failed hypothesis still gets caught.

A real chain across three cycles on a UK accountancy practice campaign in March:

Cycle 1 confirmed: question-led opening beats statement opening (1.8% to 2.4%, p=0.18). Cycle 2 confirmed: a Companies House filing reference in the personalisation beats a generic industry observation (2.4% to 3.1%, p=0.09). The question-led opening was preserved. Cycle 3 failed: a one-week guarantee window underperformed a four-week guarantee in the offer (3.1% to 2.6%, regression, reverted at the next cycle's revert check).

By cycle 4 the writing prompt knew that question-led openings work, Companies House signals work, and tight guarantee windows do not work for this audience. Cycle 4's hypothesis stacked on the first two and skipped the third. Cycle 5 stacked again. The chain compounds. This is the cumulative-learning property the broader Claude-skills approach builds on, and it is what separates a real optimisation loop from a copy-spinner.

Spintax stays sacred

The system never reduces spintax group count. Every replacement variant must carry at least as many spintax groups as the variant it replaces. Group count is a hard constraint inside the validation step, and a generated variant that fails it gets rejected before push.

Spintax is what stops outbound emails being clustered as templated by Google's spam filters. A campaign sending 3,000 messages a day with no variation lights up filter signatures within a week. The same 3,000 messages with five spintax groups of three options each can resolve to 243 distinct surface forms, and the filter signature collapses.

What the system does is optimise the options inside the groups. The metrics computer reports per-option positive reply rate within each group. An option with zero positives across 100 sends becomes a candidate for replacement. An option with a clear lift over its sibling options stays untouched. The new replacement option carries a hypothesis of its own and gets logged like any other change.

A concrete example. A Pipeline cycle on a UK training-provider campaign in February showed the following spintax performance on the opening greeting group:

Hi {{firstName}}: 2.1% positive reply rate across 1,210 sends. Hey {{firstName}}: 2.3% positive reply rate across 1,180 sends. {{firstName}} (bare name, no greeting): 0.4% positive reply rate across 1,205 sends.

The bare-name option was replaced with a new option (Morning {{firstName}}) testing whether a time-of-day cue lifts the same warmth signal that Hey carries. Group count stayed at three. The losing option was retired without touching either of the working ones.

Reading the replies

The metrics tell you which variant is winning. The replies tell you why. Operators who only look at the numbers leave the highest-leverage insight on the floor every single cycle.

The discipline is to read every winning email word by word before generating new copy. The metrics output exposes the resolved version of every email that triggered a positive reply, with all spintax options resolved and all variable tags substituted. These are the actual messages a human read on a Tuesday morning before deciding to reply yes. Note the subject line that got past her glance, the opening sentence that pulled her into paragraph two, the offer phrasing she chose to reference back, and the CTA wording she answered.

Then read the non-reply emails sent in the same cycle window. Same template, different spintax resolutions, different recipient. What is present in the winners that is absent in the silent ones? Often it is a single word in the personalisation, a single phrase in the offer, or a specific spintax option in the CTA group. The contrast surfaces the unit of variation that mattered.

Then read the reply text itself, every reply. What did the positive responder write back? Did she quote a number from the email? Did she ask a question that referenced the offer? Did she pick up the casual register or the formal one? Her words are the strongest signal you have, because they are a real human telling you what landed.

A real reply we got on a Bristol-based logistics campaign in January read: "the Companies House filing thing made me look twice, happy to chat for 15 minutes." That single sentence confirmed the personalisation hypothesis (Companies House reference works on this audience), the call-length hypothesis (15 minutes is the right ask), and the casual register hypothesis (lowercase, no salutation, conversational tone). One reply, three confirmed signals, all of which fed back into the next cycle's writing prompt.

Every cycle, every campaign, every report. We read the replies.

Auto-revert as the safety net

Copy can only get better between cycles. Never worse. This is the monotonic-progress property, and it is enforced by an automatic revert trigger at the start of every cycle.

The mechanism is simple. Every push records a snapshot of the sequences as they stood immediately before the change, and the overall campaign positive reply rate at that moment. The next cycle's first non-trivial step is to compare the current overall positive reply rate against the snapshot. If the rate has dropped, the system rolls the sequences back to the snapshot, marks the previous push's hypothesis as failed in experiment_log, and aborts the cycle without generating new copy.

The next cycle starts clean from the restored snapshot. The failed hypothesis goes into the do-not-repeat list. The chain resumes from a known-good state.

Two safeguards make the comparison honest. First, only emails sent 24 hours or more before the cycle window count toward the performance assessment, because positive replies typically arrive within the first 18 hours of an email landing in the inbox. Without the 24-hour rule, a fresh push would get judged before any of its sends had time to reply, and the system would either constantly revert on noise or fail to revert on real regressions. Second, the revert compares overall positive reply rate, not per-variant. A change that helps one variant while accidentally tanking the campaign average gets caught. A change that helps both variants does not get reverted just because one moved faster than the other. The campaign-level metric is the floor.

A real-shape cycle

Numbers below are anonymised from a January cycle on a UK SaaS campaign.

Cycle 7 metrics, computed on the 24-hour-eligible window:

Step 0, Variant A: 1,034 sends, 19 positive replies, 1.8% positive reply rate. Step 0, Variant B: 1,028 sends, 25 positive replies, 2.4% positive reply rate.

Fisher's exact on the 2x2 table returned p=0.18. Just inside the p<0.20 directional threshold. The decision engine declared Variant B the winner.

Variant B's defining feature was a question-led opening: "are you still running quotes through the QuotedRight builder?" Variant A opened with a statement: "noticed you've been running quotes through the QuotedRight builder for about six months now." Same personalisation signal underneath, different cognitive ask at the top.

Generation phase ran. The system kept Variant B unchanged because the winner gets preserved by default. It generated a new Variant A testing a fresh hypothesis: "a shorter subject line lifts the inbox-glance pass rate at the same reply quality." The new subject was three words instead of seven. The body kept Variant B's question-led opening so the test isolated the subject change.

Push went through cleanly. Snapshot, pause, PATCH, activate, all inside the try/finally. The cycle logged the new Variant A to experiment_log with the hypothesis and the prior baseline, and went back to sleep.

Roughly 1,000 sends per variant later, the next cycle ran:

Step 0, new Variant A: 1,041 sends, 28 positive replies, 2.7% positive reply rate. Step 0, Variant B (unchanged): 1,022 sends, 25 positive replies, 2.4% positive reply rate.

Fisher's exact on the new pair returned p=0.41. Inconclusive on a single cycle. The trajectory section flagged that the new Variant A had now outperformed Variant B by about 0.3 percentage points across two consecutive cycle windows. The decision engine deferred the call. Cycle 9 confirmed: shorter subject won, the hypothesis stuck, and the next cycle stacked a new test (a casual signature variant) on top of the now-stable Variant A.

What you are actually running

A continuously improving copywriter that never sleeps, never forgets a hypothesis, never repeats a failed experiment, and never lets the campaign get worse between cycles. Six hours between checks. Three minutes per cycle. Fully unattended after the first three cycles of calibration.

The version of AI most agencies are selling in 2026 is content generation. Output volume, dashboard reporting, model-of-the-month, dressed up as transformation. None of it moves a number a CFO would notice. The loop we operate moves positive reply rate from 1.8% to 2.4% to 3.1% across cycles, and keeps it moving for as long as the campaign runs. A 1.5x improvement on positive reply rate at constant send volume is a 1.5x improvement on booked meetings at the same budget. That is the number a CFO notices.

This loop sits inside the cold outbound engagements we run for clients. We operate the lead generation, the personalisation, the sending infrastructure, the inbox monitoring, and this optimisation loop, end to end. The client sees booked meetings. The infrastructure stays under our roof.

The same loop sits inside Sales OS, the custom AI platform we build when a client wants the optimisation engine running on their own outbound. Different deployment shape, same primitives. The Claude Code CLI runs against their Instantly account, the Supabase tables sit in their cloud, and the cron job runs on their schedule. The hypothesis chain is theirs, the failed-experiment list is theirs, and the snapshot history is theirs.

For the broader pattern this loop comes out of, the Claude-skills piece walks through how we operationalise Claude as the reasoning layer inside an otherwise deterministic system. The optimisation engine is one example. Everything around the model is built so the model only ever has to do the thing the model is best at.