Most teams use AI badly
The pattern is everywhere. Someone on the team writes a long prompt that produces a useful answer once. They paste it into ChatGPT every week or so when they remember. The prompt drifts a little each time as they tweak it. The output drifts with it. Six months later there is no version history, no record of which version of the prompt actually worked, and no way for anyone else on the team to run the same workflow without learning it from scratch.
That is the small version of a much bigger problem. MIT's 2025 GenAI study found that roughly 95% of enterprise AI pilots produce no measurable P&L impact, after $30 to $40 billion of corporate spend (Fortune, August 2025). Gartner had already warned in July 2024 that 30% of generative AI projects would be abandoned after proof of concept by the end of 2025. McKinsey's November 2025 State of AI report rounded out the picture: 88% of companies use AI somewhere, only 5.5% see meaningful financial returns, and only 39% can point to any measurable EBIT effect at all.
The models are fine. Most AI work lives in someone's head, in a paste-buffer, or in a Slack message that scrolls away. There is no infrastructure underneath it. A piece of work that runs once and disappears can't compound, can't be improved, and can't be handed to anyone else.
Anthropic's Agent Skills format is one of the cleaner answers to that problem. It turns a prompt into a reusable, version-controlled, discoverable piece of infrastructure that any Claude instance on your team can pick up and run. This piece walks through what a skill actually is, how we use them at Tannto to run real sales operations, and how to build your first one tomorrow.
What a skill actually is
A skill is a directory on disk. The minimum content is one file called SKILL.md with YAML frontmatter at the top and a markdown body underneath. The frontmatter has two required fields: name (lowercase letters, numbers, hyphens, max 64 characters) and description (max 1024 characters, written in third person, explaining what the skill does and when to use it).
Here is the smallest possible skill:
---
name: writing-cold-emails
description: Writes cold email copy for B2B outreach in the writer's voice. Use when the user wants to draft cold emails, follow-ups, or sales sequences.
---
# Writing cold emails
Use this writing framework: subject line, teaser, three-sentence body, single-question CTA. Keep British spellings. Strip filler phrases.
That is it. Drop that file into ~/.claude/skills/writing-cold-emails/SKILL.md and Claude will load the metadata at startup and pull in the body when the user asks for cold-email help.
The architecture worth understanding is what Anthropic calls progressive disclosure. There are three loading levels.
Level 1 is metadata, around 100 tokens per skill. The name and description are pre-loaded into Claude's system prompt at every conversation start, so Claude knows the skill exists and when it should fire. You can have fifty skills installed and the metadata cost still sits under 5,000 tokens.
Level 2 is the SKILL.md body, ideally under 5,000 tokens. Claude only reads it when the description matches what the user asked for. The cold-email-writing skill stays dormant until someone says "write a cold email", at which point Claude reads the body via bash and loads the rules into context.
Level 3 is supporting files, effectively unlimited. Reference documents, schemas, code examples, executable scripts. None of it consumes context until Claude opens the file or runs the script. If the script outputs a result, only the output enters context, not the script source.
The full architecture is documented at platform.claude.com/docs. Anthropic launched skills on 16 October 2025 and made the specification an open standard on 18 December 2025, with a partner directory featuring Atlassian, Canva, Cloudflare, Figma, Notion, Stripe, and Zapier.
The two principles that decide if a skill is good
Two principles decide whether a skill is good. Both come straight from Anthropic's authoring guide and both are easy to fail.
The first is concise. The model is already very smart. The skill should only contain context the model doesn't already have. The temptation when writing a skill is to behave like an over-helpful tutor and explain everything from first principles. Anthropic's own guide gives the canonical example. The bad version reads like this:
PDF (Portable Document Format) files are a common file format that contains text, images, and other content. To extract text from a PDF, you'll need to use a library. There are many libraries available for PDF processing, but pdfplumber is recommended because it's easy to use and handles most cases well. First, you'll need to install it using pip...
The good version is two lines:
Use pdfplumber for text extraction.
import pdfplumber with pdfplumber.open("file.pdf") as pdf: text = pdf.pages[0].extract_text()
Same outcome, around a third of the tokens, no condescension. Every sentence in a skill should be challenged with one question: would Claude already know this without me writing it down? If yes, cut it.
The second principle is appropriate degrees of freedom. Match the level of specificity in the skill to how fragile the underlying task is. Anthropic uses a robot-on-a-path analogy. Sometimes the robot is on a narrow bridge with cliffs on both sides, where there is one safe path and any deviation breaks something. Sometimes the robot is in an open field with no hazards, where many routes lead to the same place.
A database migration is the bridge. Specific commands, specific flags, specific order, no improvisation. The skill should be tight: "Run exactly this script. Do not modify the command or add additional flags." Low freedom.
A code review is the open field. Many valid approaches. Heuristics matter more than rules. The skill should give general direction and trust the model: analyse the structure, check for edge cases, verify project conventions, suggest improvements. High freedom.
Most skills get this wrong in one direction or the other. Either they over-specify a creative task and produce mechanical output that misses the actual judgement call, or they under-specify a fragile task and produce confident garbage that breaks production. Our cold-email writing skill is mostly high freedom because copywriting depends on context. The execution layer underneath it (the API calls that push copy into Instantly) is low freedom because the order of operations matters. Different tasks, different specificity, same skill format.
Three production skills, dissected
We run roughly twenty skills in production at Tannto across cold outbound, lead enrichment, content, follow-up, and internal operations. Three are worth showing because each illustrates a different way the skill format earns its keep.
optimising-cold-emails
This is the autonomous loop that improves cold-email copy every 1,000 sends. A cron job fires every six hours, checks whether any enrolled campaign has crossed the threshold, and if so triggers a 9-phase state machine. SYNC pulls new sent and reply data from Instantly. COMPUTE calculates positive-reply rates with full run history. REVERT CHECK rolls back to the previous winning copy if the most recent change made things worse. ASSESS PRIOR backfills outcome data on pending experiments. DECIDE applies a multi-criteria confidence framework (Fisher's exact test at p < 0.20, zero-replies-after-250-sends, two consecutive runs favouring the same winner, rate ratio of 3:1 or higher with at least three winner positives). GENERATE invokes Claude with the writing playbook and the experiment history to produce replacement copy. PUSH snapshots the existing campaign, pauses it, patches in the new copy, and reactivates it. LOG and REPORT close the loop in Supabase.
The full system is closer to 4,000 lines of code, prompts, schemas, and references. Without the skill format that body would have to be reloaded into Claude's context every cycle, which makes the loop economically nonsense. With the skill format, the SKILL.md is around 350 lines and most of the heavy material lives in supporting files that Claude only reads when the relevant phase fires. The cron job runs unattended for weeks. The loop itself is covered in detail in the cold-email copy optimisation piece.
follow-up-nurture
This is the skill the team runs every morning to handle the inbox. The user types "morning nurture" or "run my follow-ups". Claude fetches every interested lead from Instantly, applies escalation-stage logic (first nudge at three days, second at seven days, third at fourteen, transition to Gmail Takeover at stage four), drafts the next email for each lead in voice, presents the drafts inline, and sends only the ones the user approves. Meeting reminders fire automatically. No-show recovery has its own template family.
The skill encodes three things that would otherwise have to live in someone's head. First, the cadence rules: when a lead is due, when it has been replied to, when it should be flagged for manual review. Second, the email-copy rules: British spellings, two-to-four sentences, contractions, no exclamation marks, reference something specific from the prior thread. Third, the channel-transition logic for handing a lead off from Instantly to Gmail when the relationship has warmed up. Anyone on the team can run "morning nurture" and Claude executes the same workflow. The output is consistent across operators because the skill is the source of truth.
ai-pulse
This is the briefing skill we use to keep up with what is actually shipping in AI. The user runs it weekly. Claude searches Anthropic's blog, the Claude platform docs, Claude Code release notes, OpenAI announcements, Google AI blog posts, Cursor's changelog, two YouTube creators (Chase AI, Nick Saraev), and a small set of business-AI news sources. Then it filters the lot through one question that is hard-coded into the skill: how does this make or save money for a B2B sales operator? Anything that doesn't pass that filter gets cut. Output is a single short brief: what shipped, what changed, what to do about it.
The reason ai-pulse exists is that the AI news cycle now turns weekly. Claude Opus 4.7 shipped on 16 April 2026. The Model Context Protocol moved to the Linux Foundation in December 2025, with OpenAI, Google, and Microsoft as co-sponsors, and roughly 78% of enterprise AI teams now have at least one MCP-backed agent in production. Google launched the Gemini Enterprise Agent Platform at Cloud Next '26 in April. Skills themselves became an open standard in December 2025. Reading everything yourself is a part-time job most operators don't have. Reading none of it means you find out about a model retirement when your production tooling silently degrades. A skill that runs weekly, reads thirty sources in parallel, and surfaces the five things that actually matter for revenue is the version of "stay current" we trust.
Skill-authoring patterns that actually work
A short pass through the patterns that show up in every skill we trust.
Progressive disclosure, one level deep. SKILL.md is the table of contents. It points to reference files like forms.md, examples.md, validation-rules.md. Those reference files do not point to other reference files. Claude tends to partial-read nested chains (running head -100 on the second file rather than reading it in full), so the rule is one hop from SKILL.md to the supporting material, no further.
Forward slashes in paths. Always. Even on Windows. scripts/helper.py, never scripts\helper.py. Backslashes break on Linux runtimes and almost every production execution environment is Linux underneath.
Gerund-form skill names. writing-cold-emails, analysing-spreadsheets, processing-pdfs. The verb-plus-ing form makes the skill's job obvious at a glance. Avoid generic names like helper, utils, or tools. They produce duplicate skills six months in.
Third-person descriptions. Anthropic's guide is firm on this. "Processes Excel files and generates reports", not "I can help you process Excel files". The description is injected into the system prompt, and inconsistent point-of-view causes discovery problems.
Workflows with checklists for complex tasks. Anything multi-step gets a copyable checklist at the top of the skill. Claude works through it visibly and the user can see where it is.
Feedback loops on quality-critical tasks. Validate, fix, re-validate. Our optimising-cold-emails skill runs check.py on every generated copy variant and rewrites once if a violation hits. The humaniser skill we use to polish writing runs a 60-item audit on every output and rewrites if it fails.
Avoid time-sensitive content. "If you're doing this before August 2025" rots the moment it is written. Use a "current method" section and an "old patterns" section in collapsed details if historical context matters.
Consistent terminology. Pick "API endpoint", not "URL" sometimes and "API route" other times. Consistency lets Claude generalise from one section of the skill to another.
The full author's checklist is at platform.claude.com/docs/agent-skills/quickstart.
Five anti-patterns that ruin skills
Five failures we keep seeing in skills that look fine on the page and break in production.
Verbose system-prompt-style explanations. A skill that opens with "In this skill, I will help you understand the principles of effective email writing..." is almost always padded with context Claude already has. The symptom in production is bloated context windows and slow responses. Cut everything Claude already knows.
Trying to be helpful by listing every option. "You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image, or PyPDF2, or PDFMiner..." Claude will pick one at random. Pick a default. Add a single escape hatch ("for scanned PDFs needing OCR, use pdf2image with pytesseract"). Decision-making inside the skill, at runtime.
Deep-nested file references. SKILL.md references advanced.md, which references details.md, which is where the actual content lives. Claude tends to preview the second hop with head and never reaches the third. The symptom is Claude confidently writing code that does the wrong thing because it never read the rule that would have prevented it.
Voodoo constants in scripts. TIMEOUT = 47 with no comment. MAX_RETRIES = 5 with no comment. If you don't know why the value is what it is, neither does Claude when something breaks. Document the reasoning beside the number, or pick a defensible default.
Punting errors back to Claude. Scripts that throw a generic exception and let the model figure it out. The symptom is silent failures because Claude rationalises a wrong-shaped error response into a plausible-looking success. Handle errors in the script, with explicit messages that tell Claude what to do next.
How this connects to the broader AI thesis
The version of AI worth investing in is the version that moves a number a CFO would notice. Most AI work doesn't.
MIT's 2025 study found 95% of GenAI pilots produced no measurable P&L impact. Gartner forecast 30% of GenAI projects abandoned after proof of concept by the end of 2025. McKinsey's November 2025 State of AI showed only 5.5% of companies seeing meaningful financial returns despite 88% adoption. The pattern across all three datasets is the same. Most AI gets pointed at fluffy work (summarise this email, draft this reply, take notes from this call) that nobody in a leadership meeting tracks.
Skills change the maths because they make the AI work compoundable. A workflow that lives in a skill is version-controlled, discoverable, runnable by anyone on the team, and improvable based on what actually shipped. Once you have one skill that books meetings or generates copy that converts, you build the next one on top of it. Our cold-email writing skill feeds the optimising-cold-emails loop. The loop's outputs feed the follow-up-nurture skill. The nurture skill, when a meeting closes, feeds the contract-hosting skill. The pipeline is a stack of skills, each one a small piece of operational infrastructure that does a specific job continuously and hands its output to the next one in the line.
That is the Sales OS thesis. Sales operations look like an org chart today. The future shape is closer to a stack of skills, each running continuously, improving on its own numbers, with humans on the conversations and the relationships, and the platform handling the rest.
Every layer of the cold-outbound stack we run for clients is a Tannto skill. Lead generation, personalisation, the optimisation loop, the inbox monitoring, the lot. The fact that the work runs without a daily human in the loop is what the skill format makes possible. That same architecture sits at the centre of the custom platforms we build for clients.
What to do tomorrow
Three concrete suggestions if you want to try this.
First, take the most-repeated prompt your team uses. The one someone has saved in a Notion doc, in their Slack drafts, or in their head. Open a new file called SKILL.md. Paste the prompt body into the markdown section. Add YAML frontmatter at the top with a name (lowercase, hyphenated) and a description that explains what the skill does and when to use it. Save it to ~/.claude/skills/<your-skill-name>/SKILL.md. Restart Claude Code. The skill is live.
Second, run it twice on real work. Notice what context you keep needing to add manually. Names of products, internal jargon, escalation rules, formatting preferences. That extra context is exactly what should go into the skill body. Add it. Run the skill again. You should be able to invoke it cold and get the right output without typing the same instructions every time.
Third, read Anthropic's skill-authoring guide end to end at platform.claude.com/docs/agent-skills/quickstart. It is the cheapest way to avoid the anti-patterns above. Most of the patterns we use in production came from re-reading that document at different stages of building.
If your team is doing real cold outbound and you want to see what a fully-skilled stack looks like, the Pipeline service we operate end to end is where to start. If you want help working out where AI actually moves a number for your business (including the version of the answer where the answer is "AI doesn't belong here"), the discovery call is half an hour and free.


