ISO 27001 ISMS Benchmark: Rakenne vs GPT-4o on Audit-Ready Documentation

A side-by-side benchmark comparing Rakenne's ISO 27001 skills against GPT-4o on control name accuracy, hallucination rates, cross-document traceability, and audit readiness — with real output examples from both systems.

2026-04-02
Iso-27001 Comparison Validation Compliance Isms Workflows Benchmark

Author Ricardo Cabral · Founder

ISO 27001 isn’t one document — it’s a system of interdependent artifacts where a mistake in the risk register cascades through every downstream deliverable. That makes it the perfect stress test for AI compliance tools.

We ran the same 8-phase ISO 27001 workflow through Rakenne and GPT-4o to see which one could produce audit-ready ISMS documentation. We scored both outputs with an automated quality benchmark measuring hallucination rates, control name accuracy, cross-document references, and completeness.

The results were stark: 99% vs 6% on control name accuracy, 18 vs 0 internal cross-references, and 5.7x faster completion.

The Benchmark Setup

Fictional client: CloudSync Solutions — a B2B SaaS company with 85 employees, headquartered in Sao Paulo, Brazil. They run their project management platform on GCP (GKE, Cloud SQL, Cloud Storage), use GitHub for code, Google Workspace for collaboration, and Okta for SSO. Key suppliers: AWS CloudFront (CDN), Stripe (payments), SendGrid (email). They must comply with LGPD and have contractual data protection obligations from enterprise clients.

The test: 8 phases of the ISO 27001 workflow, using the same prompts for both systems:

Organization Profile
ISMS Scope Definition
Asset Inventory
Risk Assessment (5x5 matrix, threshold at 12)
Statement of Applicability (93 Annex A controls)
Policy Generation (3 core policies)
Internal Audit Report
Management Review

Rakenne used the iso27001-isms template with 23 skills and 40+ validation tools. GPT-4o used an empty project with no skills, no reference files, and no validation tools — just the same prompts with a consultant role prefix.

The Scorecard

Metric	Rakenne	GPT-4o	Winner
Hallucinations per 1K words	0.05	21.49	Rakenne
Control name accuracy	99%	6%	Rakenne
Legacy 2013 control refs	0	8	Rakenne
Internal cross-references	18	0	Rakenne
Annex A coverage (of 93)	100%	98%	Rakenne
Stale year references	1	2	Rakenne
Duration	18 min	103 min	Rakenne
Files produced	20	13	Rakenne
Word count	19,306	4,188	Rakenne
Cost per 1M generated words	$63.19	$119.39	Rakenne

Finding 1: GPT-4o Fabricated 82 of 87 Control Names

This is the most consequential finding. When GPT-4o generated the Statement of Applicability covering all 93 Annex A controls, it fabricated the control names — producing plausible-sounding titles that don’t match the actual ISO 27001:2022 standard.

Real examples from GPT-4o’s output (excerpt):

Control	Official ISO 27001:2022 Title	What GPT-4o Wrote
A.5.2	Information security roles and responsibilities	Review of the Information Security policy
A.5.3	Segregation of duties	Internal communication of the Information Security policy
A.5.4	Management responsibilities	External communication with relevant stakeholders
A.5.5	Contact with authorities	Remedial action in case of non-compliance
A.7.8	Equipment siting and protection	Policy on document langchain

An auditor reviewing this SoA would immediately see that the control names don’t match the standard. Every row is wrong. The control numbers are valid (A.5.1 through A.8.34) — but the names are hallucinated from a mix of 2013-era knowledge and general security terminology.

Rakenne’s output for the same controls (excerpt):

Category	Total	Included	Excluded
Organizational (A.5)	37	37	0
People (A.6)	8	8	0
Physical (A.7)	14	13	1
Technological (A.8)	34	33	1
Total	93	91	2

Rakenne uses reference files containing the official 93 control titles from ISO 27001:2022. The agent can’t hallucinate control names because it reads them from the reference data rather than generating from memory.

Finding 2: GPT-4o Used Deprecated 2013 Control Categories

GPT-4o’s Risk Register mapped risks to control categories that don’t exist in ISO 27001:2022:

From GPT-4o’s Risk Register (excerpt):

#	Risk	Threats	L	I	Score	Treatment	Controls
1	Unauthorized access to customer data	External attackers, insider threat	4	5	20	Mitigate	A.9 (Access Control), A.10 (Cryptography)
2	Cloud infrastructure misconfiguration	Human error	3	5	15	Mitigate	A.12 (Operations Security)
3	Insider threats	Disgruntled employees	3	4	12	Mitigate	A.7 (Human Resource Security), A.13 (Communications Security)

A.9, A.10, A.12, A.13, A.14, A.15 — these are ISO 27001:2013 control domains. The 2022 standard restructured all controls into just four themes: A.5 (Organizational), A.6 (People), A.7 (Physical), A.8 (Technological). An auditor would flag every one of these references.

From Rakenne’s Risk Register (excerpt):

ID	Assets	Threat	Vulnerability	L	I	Inherent	Treatment	Controls	Owner	RL	RI	Residual
R-01	Customer Project Data, User Credentials & Auth Tokens, Customer PII (LGPD), Financial Records (Stripe)	Unauthorized access (External)	Weak access controls / Lack of MFA	4	5	20	Treat	A.5.15, A.5.16, A.5.18, A.5.34, A.8.2, A.8.3, A.8.5	Maria Santos (CISO)	2	5	10

Rakenne references specific 2022 controls (A.5.15 Access control, A.8.5 Secure authentication) rather than deprecated category-level references. Each risk maps to concrete controls with residual risk scores and named owners.

Finding 3: Rakenne Documents Reference Each Other — GPT-4o’s Don’t

Rakenne produced 18 internal cross-references between output documents. GPT-4o produced zero.

From Rakenne’s Information Security Policy (excerpt):

11. Related Documents

Document	Reference
ISMS Scope Statement	`output/ISMS-Scope-Statement.md`
Risk Assessment Methodology	`output/Risk-Assessment-Methodology.md`
Statement of Applicability	`output/Statement-of-Applicability.md`
Risk Management Policy	`output/POL-002-Risk-Management-Policy.md`
Access Control Policy	`output/POL-003-Access-Control-Policy.md`

From GPT-4o’s Information Security Policy (excerpt):

6. Related Documents

POL-002: Risk Management Policy
POL-003: Access Control Policy
Statement of Applicability (SoA)

GPT-4o lists document titles but doesn’t link to actual files. An auditor tracing the thread from risk to control to policy to evidence would find no navigable references.

Finding 4: GPT-4o Policies Have Wrong Dates

From GPT-4o’s policy (excerpt):

Field	Value
Document ID	POL-001
Version	1.0
Date	2024-05-20

The benchmark ran on April 1, 2026. GPT-4o dated the policy “2024-05-20” — two years in the past. This suggests the model used dates from its training data rather than the current context.

From Rakenne’s policy (excerpt):

Effective date: 2026-04-01

Next review date: 2027-04-01

Document owner: Maria Santos (CISO)

Approved by: Joao Silva (CTO)

Rakenne generates dates from the current context and includes classification level, change history, and next review date — all required by ISO 27001 Clause 7.5 for document control.

Finding 5: Management Review Quality

ISO 27001 Clause 9.3 requires 10 specific inputs for management review. Both systems addressed them, but the depth differed dramatically.

Rakenne’s action tracker (excerpt):

Action ID	Description	Owner	Due Date	Expected Outcome	Status
MR-01-A01	Complete retrospective access review for GCP/Okta (NC-001).	Maria Santos	2026-09-10	Signed review log available.	Not started
MR-01-A02	Run and document a full backup restore test (NC-002).	Joao Silva	2026-09-10	Restore log with 100% success.	Not started
MR-01-A03	Update POL-001 with new Management Sponsor signature.	Maria Santos	2026-09-05	Updated policy published.	Not started
MR-01-A04	Launch “Security Month” training push to hit 100%.	Maria Santos	2026-09-30	100% completion report generated.	Not started

GPT-4o’s action tracker (excerpt):

Action	Owner	Due Date	Status
Improve incident response procedure	Maria Santos	30 Sep 2026	Open
Schedule backup verification tests	Joao Silva	30 Jun 2026	Open
Complete comprehensive access review	Ana Oliveira	30 Jun 2026	Open

Rakenne’s actions link to specific audit non-conformities (NC-001, NC-002), include measurable expected outcomes (“Signed review log available”), and were validated by the action completeness checker tool to ensure every action has an owner, due date, and expected outcome. GPT-4o produced generic actions with no linkage to audit findings and no measurable outcomes.

Execution Performance

Metric	Rakenne	GPT-4o
Total duration	18 minutes	103 minutes
Total tokens	1,136,104	153,247
Words per minute	1,068	41
Cost per 1M generated words	$63.19	$119.39

Rakenne completed 5.7x faster despite processing 7.4x more tokens. It produced 4.6x more content — making it 47% cheaper per generated word ($63 vs $119 per 1M generated words). GPT-4o’s duration includes multiple 600-second timeouts (the SoA timed out and needed retry) and garbled output retries.

Why the Difference Exists

The gap isn’t about model intelligence — GPT-4o is a capable model. The difference comes from three architectural advantages Rakenne offers:

1. Reference Files Prevent Hallucination

Rakenne’s ISO 27001 skills include reference files with the official 93 Annex A control titles, clause requirements, and document templates. The agent reads these files before generating output. It can’t hallucinate “Review of the Information Security policy” for A.5.2 because it reads “Information security roles and responsibilities” from the reference file.

GPT-4o relies entirely on training data. Its ISO 27001 knowledge is a blend of 2013 and 2022 standard versions, producing a control naming scheme that exists in neither.

2. Validation Tools Catch Errors Before Output

Rakenne’s skills include 40+ extension tools that validate output as it’s produced:

Risk methodology validator checks the 5x5 matrix is complete
Risk entry validator ensures every risk has all required fields
Residual risk validator confirms residual scores don’t exceed inherent scores
SoA control justification audit cross-references the SoA against scope and risk assessment
Audit impartiality checker verifies auditor independence
Action completeness checker ensures every management review action has an owner and due date

GPT-4o has no validation tools. It produces output and moves on — there’s no mechanism to check its own work.

3. Shared Context Creates a Document System

Each Rakenne skill reads the output of previous skills. The Risk Assessment reads the Asset Inventory. The SoA reads the Risk Register. The Policy Generator reads the SoA. The Management Review reads the Audit Report.

GPT-4o processes each prompt independently. Even within the same conversation, it doesn’t read back the files it wrote — it generates new content from the conversation context, which is why cross-references are absent and data consistency degrades across phases.

Benchmark Methodology

This comparison was run using Rakenne’s automated quality scorer, which measures:

Hallucination detection: Invalid ISO clauses, invalid Annex A controls, wrong control names (keyword overlap against official titles), legacy 2013 categories (A.9+ detection)
Reference integrity: ISO clause accuracy, Annex A accuracy, cross-document reference validation
Consistency analysis: Company name, role titles, terminology synonyms across documents
Completeness scoring: Mandatory fields per document type checked against ISO requirements
Date consistency: Year extraction with staleness threshold
Internal cross-references: Document interconnectedness measured by file references
Execution stats: Token usage, cost, duration from session files

Both projects used the same Rakenne backend infrastructure (workspace file I/O, session management). The only difference was the skill layer and LLM model. All output examples in this article are real, unedited excerpts from the benchmark run.

Try It Yourself

Rakenne’s ISO 27001 skills are available with no signup required. Start with the Organization Profile, then work through Risk Assessment, SoA, and Policy Generation. See how the validation tools catch issues ChatGPT would miss.

Get Started Free — No Sign-Up

This benchmark was conducted on April 2, 2026 using Rakenne and OpenAI GPT-4o. Both platforms are updated frequently — your results may vary.

Try it yourself

Open a workspace with the skills described in this article and start drafting in minutes.