ISO 27001 ISMS Benchmark: Rakenne vs GPT-4o on Audit-Ready Documentation

A side-by-side benchmark comparing Rakenne's ISO 27001 skills against GPT-4o on control name accuracy, hallucination rates, cross-document traceability, and audit readiness — with real output examples from both systems.

Author Ricardo Cabral · Founder

ISO 27001 isn’t one document — it’s a system of interdependent artifacts where a mistake in the risk register cascades through every downstream deliverable. That makes it the perfect stress test for AI compliance tools.

We ran the same 8-phase ISO 27001 workflow through Rakenne and GPT-4o to see which one could produce audit-ready ISMS documentation. We scored both outputs with an automated quality benchmark measuring hallucination rates, control name accuracy, cross-document references, and completeness.

The results were stark: 99% vs 6% on control name accuracy, 18 vs 0 internal cross-references, and 5.7x faster completion.


The Benchmark Setup

Fictional client: CloudSync Solutions — a B2B SaaS company with 85 employees, headquartered in Sao Paulo, Brazil. They run their project management platform on GCP (GKE, Cloud SQL, Cloud Storage), use GitHub for code, Google Workspace for collaboration, and Okta for SSO. Key suppliers: AWS CloudFront (CDN), Stripe (payments), SendGrid (email). They must comply with LGPD and have contractual data protection obligations from enterprise clients.

The test: 8 phases of the ISO 27001 workflow, using the same prompts for both systems:

  1. Organization Profile
  2. ISMS Scope Definition
  3. Asset Inventory
  4. Risk Assessment (5x5 matrix, threshold at 12)
  5. Statement of Applicability (93 Annex A controls)
  6. Policy Generation (3 core policies)
  7. Internal Audit Report
  8. Management Review

Rakenne used the iso27001-isms template with 23 skills and 40+ validation tools. GPT-4o used an empty project with no skills, no reference files, and no validation tools — just the same prompts with a consultant role prefix.


The Scorecard

MetricRakenneGPT-4oWinner
Hallucinations per 1K words0.0521.49Rakenne
Control name accuracy99%6%Rakenne
Legacy 2013 control refs08Rakenne
Internal cross-references180Rakenne
Annex A coverage (of 93)100%98%Rakenne
Stale year references12Rakenne
Duration18 min103 minRakenne
Files produced2013Rakenne
Word count19,3064,188Rakenne
Cost per 1M generated words$63.19$119.39Rakenne

Finding 1: GPT-4o Fabricated 82 of 87 Control Names

This is the most consequential finding. When GPT-4o generated the Statement of Applicability covering all 93 Annex A controls, it fabricated the control names — producing plausible-sounding titles that don’t match the actual ISO 27001:2022 standard.

Real examples from GPT-4o’s output (excerpt):

ControlOfficial ISO 27001:2022 TitleWhat GPT-4o Wrote
A.5.2Information security roles and responsibilitiesReview of the Information Security policy
A.5.3Segregation of dutiesInternal communication of the Information Security policy
A.5.4Management responsibilitiesExternal communication with relevant stakeholders
A.5.5Contact with authoritiesRemedial action in case of non-compliance
A.7.8Equipment siting and protectionPolicy on document langchain

An auditor reviewing this SoA would immediately see that the control names don’t match the standard. Every row is wrong. The control numbers are valid (A.5.1 through A.8.34) — but the names are hallucinated from a mix of 2013-era knowledge and general security terminology.

Rakenne’s output for the same controls (excerpt):

CategoryTotalIncludedExcluded
Organizational (A.5)37370
People (A.6)880
Physical (A.7)14131
Technological (A.8)34331
Total93912

Rakenne uses reference files containing the official 93 control titles from ISO 27001:2022. The agent can’t hallucinate control names because it reads them from the reference data rather than generating from memory.


Finding 2: GPT-4o Used Deprecated 2013 Control Categories

GPT-4o’s Risk Register mapped risks to control categories that don’t exist in ISO 27001:2022:

From GPT-4o’s Risk Register (excerpt):

#RiskThreatsLIScoreTreatmentControls
1Unauthorized access to customer dataExternal attackers, insider threat4520MitigateA.9 (Access Control), A.10 (Cryptography)
2Cloud infrastructure misconfigurationHuman error3515MitigateA.12 (Operations Security)
3Insider threatsDisgruntled employees3412MitigateA.7 (Human Resource Security), A.13 (Communications Security)

A.9, A.10, A.12, A.13, A.14, A.15 — these are ISO 27001:2013 control domains. The 2022 standard restructured all controls into just four themes: A.5 (Organizational), A.6 (People), A.7 (Physical), A.8 (Technological). An auditor would flag every one of these references.

From Rakenne’s Risk Register (excerpt):

IDAssetsThreatVulnerabilityLIInherentTreatmentControlsOwnerRLRIResidual
R-01Customer Project Data, User Credentials & Auth Tokens, Customer PII (LGPD), Financial Records (Stripe)Unauthorized access (External)Weak access controls / Lack of MFA4520TreatA.5.15, A.5.16, A.5.18, A.5.34, A.8.2, A.8.3, A.8.5Maria Santos (CISO)2510

Rakenne references specific 2022 controls (A.5.15 Access control, A.8.5 Secure authentication) rather than deprecated category-level references. Each risk maps to concrete controls with residual risk scores and named owners.


Finding 3: Rakenne Documents Reference Each Other — GPT-4o’s Don’t

Rakenne produced 18 internal cross-references between output documents. GPT-4o produced zero.

From Rakenne’s Information Security Policy (excerpt):

11. Related Documents

DocumentReference
ISMS Scope Statementoutput/ISMS-Scope-Statement.md
Risk Assessment Methodologyoutput/Risk-Assessment-Methodology.md
Statement of Applicabilityoutput/Statement-of-Applicability.md
Risk Management Policyoutput/POL-002-Risk-Management-Policy.md
Access Control Policyoutput/POL-003-Access-Control-Policy.md

From GPT-4o’s Information Security Policy (excerpt):

6. Related Documents

  • POL-002: Risk Management Policy
  • POL-003: Access Control Policy
  • Statement of Applicability (SoA)

GPT-4o lists document titles but doesn’t link to actual files. An auditor tracing the thread from risk to control to policy to evidence would find no navigable references.


Finding 4: GPT-4o Policies Have Wrong Dates

From GPT-4o’s policy (excerpt):

FieldValue
Document IDPOL-001
Version1.0
Date2024-05-20

The benchmark ran on April 1, 2026. GPT-4o dated the policy “2024-05-20” — two years in the past. This suggests the model used dates from its training data rather than the current context.

From Rakenne’s policy (excerpt):

Effective date: 2026-04-01

Next review date: 2027-04-01

Document owner: Maria Santos (CISO)

Approved by: Joao Silva (CTO)

Rakenne generates dates from the current context and includes classification level, change history, and next review date — all required by ISO 27001 Clause 7.5 for document control.


Finding 5: Management Review Quality

ISO 27001 Clause 9.3 requires 10 specific inputs for management review. Both systems addressed them, but the depth differed dramatically.

Rakenne’s action tracker (excerpt):

Action IDDescriptionOwnerDue DateExpected OutcomeStatus
MR-01-A01Complete retrospective access review for GCP/Okta (NC-001).Maria Santos2026-09-10Signed review log available.Not started
MR-01-A02Run and document a full backup restore test (NC-002).Joao Silva2026-09-10Restore log with 100% success.Not started
MR-01-A03Update POL-001 with new Management Sponsor signature.Maria Santos2026-09-05Updated policy published.Not started
MR-01-A04Launch “Security Month” training push to hit 100%.Maria Santos2026-09-30100% completion report generated.Not started

GPT-4o’s action tracker (excerpt):

ActionOwnerDue DateStatus
Improve incident response procedureMaria Santos30 Sep 2026Open
Schedule backup verification testsJoao Silva30 Jun 2026Open
Complete comprehensive access reviewAna Oliveira30 Jun 2026Open

Rakenne’s actions link to specific audit non-conformities (NC-001, NC-002), include measurable expected outcomes (“Signed review log available”), and were validated by the action completeness checker tool to ensure every action has an owner, due date, and expected outcome. GPT-4o produced generic actions with no linkage to audit findings and no measurable outcomes.


Execution Performance

MetricRakenneGPT-4o
Total duration18 minutes103 minutes
Total tokens1,136,104153,247
Words per minute1,06841
Cost per 1M generated words$63.19$119.39

Rakenne completed 5.7x faster despite processing 7.4x more tokens. It produced 4.6x more content — making it 47% cheaper per generated word ($63 vs $119 per 1M generated words). GPT-4o’s duration includes multiple 600-second timeouts (the SoA timed out and needed retry) and garbled output retries.


Why the Difference Exists

The gap isn’t about model intelligence — GPT-4o is a capable model. The difference comes from three architectural advantages Rakenne offers:

1. Reference Files Prevent Hallucination

Rakenne’s ISO 27001 skills include reference files with the official 93 Annex A control titles, clause requirements, and document templates. The agent reads these files before generating output. It can’t hallucinate “Review of the Information Security policy” for A.5.2 because it reads “Information security roles and responsibilities” from the reference file.

GPT-4o relies entirely on training data. Its ISO 27001 knowledge is a blend of 2013 and 2022 standard versions, producing a control naming scheme that exists in neither.

2. Validation Tools Catch Errors Before Output

Rakenne’s skills include 40+ extension tools that validate output as it’s produced:

GPT-4o has no validation tools. It produces output and moves on — there’s no mechanism to check its own work.

3. Shared Context Creates a Document System

Each Rakenne skill reads the output of previous skills. The Risk Assessment reads the Asset Inventory. The SoA reads the Risk Register. The Policy Generator reads the SoA. The Management Review reads the Audit Report.

GPT-4o processes each prompt independently. Even within the same conversation, it doesn’t read back the files it wrote — it generates new content from the conversation context, which is why cross-references are absent and data consistency degrades across phases.


Benchmark Methodology

This comparison was run using Rakenne’s automated quality scorer, which measures:

  • Hallucination detection: Invalid ISO clauses, invalid Annex A controls, wrong control names (keyword overlap against official titles), legacy 2013 categories (A.9+ detection)
  • Reference integrity: ISO clause accuracy, Annex A accuracy, cross-document reference validation
  • Consistency analysis: Company name, role titles, terminology synonyms across documents
  • Completeness scoring: Mandatory fields per document type checked against ISO requirements
  • Date consistency: Year extraction with staleness threshold
  • Internal cross-references: Document interconnectedness measured by file references
  • Execution stats: Token usage, cost, duration from session files

Both projects used the same Rakenne backend infrastructure (workspace file I/O, session management). The only difference was the skill layer and LLM model. All output examples in this article are real, unedited excerpts from the benchmark run.


Try It Yourself

Rakenne’s ISO 27001 skills are available with no signup required. Start with the Organization Profile, then work through Risk Assessment, SoA, and Policy Generation. See how the validation tools catch issues ChatGPT would miss.

Get Started Free — No Sign-Up


This benchmark was conducted on April 2, 2026 using Rakenne and OpenAI GPT-4o. Both platforms are updated frequently — your results may vary.

Try it yourself

Open a workspace with the skills described in this article and start drafting in minutes.

Get Started Free — No Sign-Up

Ready to let your expertise drive the workflow?

Stop wrestling with rigid templates and generic chatbots. Describe your process, let the agent handle the rest.

Get Started Free — No Sign-Up