Spec-Driven Document Drafting: Why It Beats Plain ChatGPT

How Rakenne skills turn complex document drafting from ad-hoc prompting into repeatable, validated workflows—with examples and output comparisons.

2026-02-20

Author Ricardo Cabral · Founder

Drafting complex, regulated documents—policies, control narratives, CAPA reports, authorization packages—with a generic chat interface is tempting: you paste a prompt and get a draft. The problem is consistency, structure, and compliance. One-off prompts don’t enforce workflows, don’t run checks, and don’t give the model a way to correct itself. Spec-driven document drafting in Rakenne turns that around: skills define workflows, load references, and use extension tools so the agent can validate, fix, and maintain quality. This article explains why that approach is dramatically better than “regular prompting on ChatGPT” and shows concrete examples and output comparisons.

Why spec-driven beats ad-hoc prompting

Concern	Plain ChatGPT (or similar)	Rakenne with skills
Workflow	You describe steps in the prompt; the model may skip or reorder them.	The skill defines a fixed workflow (scope → load reference → draft → validate); the agent follows it.
References	You paste or attach docs; context gets noisy and key clauses get lost.	References live in the skill; the agent loads them on demand and keeps context focused.
Structure	Output format is whatever the model infers; sections and numbering drift.	Skills specify sections, criteria, and document structure; templates and examples keep output aligned.
Checks	No built-in way to verify coverage, completeness, or logic.	Extension tools run deterministic checks (e.g. TSC coverage, 5 Whys gate, effectiveness date); the agent fixes and re-runs until they pass.
Self-correction	The model may say “I’ve included X” without actually including it.	Validation tools return PASS/FAIL and concrete findings; the agent must address them before proceeding.
Repeatability	Each session depends on how you phrased the prompt.	Same skill produces the same workflow and checks every time; only the content varies.

In short: spec-driven drafting gives the LLM a spec (workflow + references + structure) and tools (extensions) to check and correct itself. Plain prompting does not.

How Rakenne skills implement spec-driven drafting

In Rakenne, each skill is a small package the pi agent can trigger. It usually includes:

SKILL.md — Name, description (when to use), and a workflow: ordered steps (e.g. define scope → load reference → draft → validate).
References — Markdown files (e.g. standards, criteria lists, schemas) that the agent loads when the skill is active, so the draft stays aligned to authority.
Extension tools (optional) — TypeScript tools registered with the agent (extension.ts) that run deterministic checks on the document (coverage, logic gates, completeness). The agent calls them, reads the result, and revises until checks pass.

Project-level context (document type, domain, glossary) lives in AGENTS.md in the workspace, so every conversation in that project is scoped the same way.

Below we use real Rakenne skills to illustrate workflows, checks, and self-correction, then compare outputs.

Example 1: HIQA medication policy (workflow + reference)

Skill: HIQA Designated Centre Medication . Drafts a medication policy for Irish designated centres aligned to HIQA standards.

Workflow in the skill:

Define scope — Centre type, self-administration vs administered medicines, existing procedures.
Load reference — Read references/hiqa-nssbh.md for NSSBH Theme 3 (safe care, medication safety).
Draft — Produce a policy covering ordering, storage, prescribing, administration, self-administration, MAR charts, errors/incidents, disposal, training, review.
Cross-refer — Align with incident reporting and care/support planning.

The reference file spells out HIQA’s eight themes and governance expectations so the agent doesn’t guess.

Rakenne (with skill): The agent follows the workflow, pulls in NSSBH Theme 3 and related governance text from the reference, and produces a policy with named sections (ordering, storage, administration, errors, training, etc.) and explicit alignment to “NSSBH Theme 3” and “designated centre” context.

Plain ChatGPT: You might prompt: “Write a medication policy for a care home in Ireland.” The model often returns a generic “medication policy” that could apply anywhere: vague “periodic review,” “appropriate staff,” and no tie-in to HIQA or NSSBH. You get no guarantee that Theme 3 (or storage, MAR, errors, disposal) is systematically covered, and no built-in step to “load reference then draft.”

Example 2: SOC 2 control narratives (workflow + validation tools)

Skill: SOC 2 Control Narrative Author . Builds SOC 2 documentation with control narratives, TSC mapping, and evidence placeholders.

Workflow:

Scope — Which TSC categories (Security always; optionally Availability, PI, Confidentiality, Privacy).
Control narratives — For each in-scope criterion (CC1–CC9, A1, PI1, C1, P1–P8): narrative + evidence reference.
Evidence — Add evidence placeholders per narrative.
Validate — Run extension tools and fix until they pass.

Extension tools:

check_trust_services_criteria_coverage — Ensures each in-scope TSC criterion has a narrative and evidence reference; flags unmapped criteria.
soc2_narrative_reliability_check — Applies the SOC 2 Reliability Rubric: flags vague phrasing (“reviewed periodically,” “management maintains security”) and pushes for specificity (who, how, when, where, named technology).

Rakenne (with skill): The agent drafts narratives, then runs both tools. If CC4 has no evidence reference or the narrative says “access is reviewed periodically,” the tools return FAIL with concrete findings. The agent revises (adds evidence, specifies “quarterly by the security team via IdP access report”) and re-runs until PASS.

Plain ChatGPT: You ask for “SOC 2 control narratives for Security.” You get text that often mentions “controls” and “policies” but doesn’t systematically cover CC1–CC9, doesn’t map each criterion to a narrative + evidence, and uses exactly the kind of vague language the rubric rejects (“reviewed periodically,” “appropriate controls”). There is no automated check, so gaps and vagueness only show up in audit prep.

Example 3: CAPA report (workflow + logic gates)

Skill: CAPA Report . Corrective and Preventive Action reports for non-conformities (ISO 9001 / ISO 13485).

Workflow:

Non-conformity — Finding, scope, containment.
Root cause — Complete a 5 Whys section (Why 1 … Why 5). Do not propose actions until this is done.
Gate — Run root_cause_logic_gate on the document. Proceed only when it passes.
Corrective / preventive actions — Actions, owners, due dates.
Implementation — Record completion.
Effectiveness check — Set a future date for verification. Run verification_of_effectiveness_timer before closing; do not close until it passes.

Extension tools:

root_cause_logic_gate — Checks for a Root Cause / 5 Whys section and at least five distinct why-levels. Fails if the document jumps to “solutions” before a proper 5 Whys.
verification_of_effectiveness_timer — Ensures an Effectiveness Check date exists and is in the future. Prevents closing the CAPA before verification is scheduled.

Rakenne (with skill): The agent drafts the non-conformity and root cause. If it only has three why-levels or skips to “we will train staff,” the gate returns FAIL with “Why levels detected: 3/5. Complete a 5 Whys analysis before proposing solutions.” The agent adds Why 4 and Why 5, re-runs the gate, then continues to actions. Before closing, it must set a future effectiveness date; if it puts “2025-01-15” and today is later, the timer fails and the agent must set a future date.

Plain ChatGPT: You ask for “a CAPA for an audit finding about document control.” You often get a single paragraph of root cause and a list of actions in one shot—no enforced 5 Whys, no gate, and no check that effectiveness verification is scheduled in the future. The output looks plausible but wouldn’t pass a strict ISO 9001/13485 review.

Example 4: FedRAMP authorization package (workflow + completeness check)

Skill: FedRAMP Authorization Package . SSP, attachments, SAP, SAR, POA&M.

Workflow (simplified): Categorize system → system description → authorization boundary → data flows → control implementations (with baseline reference) → SSP attachments → Validate with fedramp_package_completeness_check. Fix and re-run until PASS. Separate sub-workflows for SAP/SAR and POA&M, with the same tool used to flag missing controls, missing attachment sections, or POA&M items without remediation plans or past-due without justification.

Rakenne (with skill): The agent follows the workflow, loads baseline and attachment references, drafts sections, and runs the completeness check. Gaps (e.g. unimplemented controls without justification, missing PIA section) are reported; the agent fills them and re-validates.

Plain ChatGPT: You ask for “a FedRAMP SSP for a SaaS product.” You get a long narrative that may look like an SSP but doesn’t follow the real FedRAMP structure, control-by-control implementation, or attachment checklist. There’s no way to verify baseline coverage or attachment completeness; you discover gaps manually later.

Summary: same model, different results

The underlying LLM can be the same. The difference is how it’s used:

Plain prompting: One-shot or back-and-forth with no fixed workflow, no authoritative references loaded by the skill, no validation tools. Output is generic, structure drifts, and the model can’t reliably “check and correct” itself.
Spec-driven in Rakenne: Skills define workflow, references, and extension tools. The agent follows the workflow, loads the right references, drafts, then runs tools that return PASS/FAIL and concrete findings. It corrects until checks pass and maintains the desired structure and quality.

For complex, regulated documents—policies, control narratives, CAPAs, authorization packages—spec-driven drafting with workflows and validation tools is not a small upgrade; it’s the difference between “maybe good enough” and “audit-ready and repeatable.”

To try it, pick a Rakenne skill for your domain, create a project with that workflow, and compare the structured, validated output to what you get from a single ChatGPT prompt.