ISO 27001 ISMS Benchmark: Rakenne vs GPT-4o on Audit-Ready Documentation
A side-by-side benchmark comparing Rakenne's ISO 27001 skills against GPT-4o on control name accuracy, hallucination rates, cross-document traceability, and audit readiness — with real output examples from both systems.
ISO 27001 isn’t one document — it’s a system of interdependent artifacts where a mistake in the risk register cascades through every downstream deliverable. That makes it the perfect stress test for AI compliance tools.
We ran the same 8-phase ISO 27001 workflow through Rakenne and GPT-4o to see which one could produce audit-ready ISMS documentation. We scored both outputs with an automated quality benchmark measuring hallucination rates, control name accuracy, cross-document references, and completeness.
The results were stark: 99% vs 6% on control name accuracy, 18 vs 0 internal cross-references, and 5.7x faster completion.
The Benchmark Setup
Fictional client: CloudSync Solutions — a B2B SaaS company with 85 employees, headquartered in Sao Paulo, Brazil. They run their project management platform on GCP (GKE, Cloud SQL, Cloud Storage), use GitHub for code, Google Workspace for collaboration, and Okta for SSO. Key suppliers: AWS CloudFront (CDN), Stripe (payments), SendGrid (email). They must comply with LGPD and have contractual data protection obligations from enterprise clients.
The test: 8 phases of the ISO 27001 workflow, using the same prompts for both systems:
- Organization Profile
- ISMS Scope Definition
- Asset Inventory
- Risk Assessment (5x5 matrix, threshold at 12)
- Statement of Applicability (93 Annex A controls)
- Policy Generation (3 core policies)
- Internal Audit Report
- Management Review
Rakenne used the iso27001-isms template with 23 skills and 40+ validation tools. GPT-4o used an empty project with no skills, no reference files, and no validation tools — just the same prompts with a consultant role prefix.
The Scorecard
| Metric | Rakenne | GPT-4o | Winner |
|---|---|---|---|
| Hallucinations per 1K words | 0.05 | 21.49 | Rakenne |
| Control name accuracy | 99% | 6% | Rakenne |
| Legacy 2013 control refs | 0 | 8 | Rakenne |
| Internal cross-references | 18 | 0 | Rakenne |
| Annex A coverage (of 93) | 100% | 98% | Rakenne |
| Stale year references | 1 | 2 | Rakenne |
| Duration | 18 min | 103 min | Rakenne |
| Files produced | 20 | 13 | Rakenne |
| Word count | 19,306 | 4,188 | Rakenne |
| Cost per 1M generated words | $63.19 | $119.39 | Rakenne |
Finding 1: GPT-4o Fabricated 82 of 87 Control Names
This is the most consequential finding. When GPT-4o generated the Statement of Applicability covering all 93 Annex A controls, it fabricated the control names — producing plausible-sounding titles that don’t match the actual ISO 27001:2022 standard.
Real examples from GPT-4o’s output (excerpt):
| Control | Official ISO 27001:2022 Title | What GPT-4o Wrote |
|---|---|---|
| A.5.2 | Information security roles and responsibilities | Review of the Information Security policy |
| A.5.3 | Segregation of duties | Internal communication of the Information Security policy |
| A.5.4 | Management responsibilities | External communication with relevant stakeholders |
| A.5.5 | Contact with authorities | Remedial action in case of non-compliance |
| A.7.8 | Equipment siting and protection | Policy on document langchain |
An auditor reviewing this SoA would immediately see that the control names don’t match the standard. Every row is wrong. The control numbers are valid (A.5.1 through A.8.34) — but the names are hallucinated from a mix of 2013-era knowledge and general security terminology.
Rakenne’s output for the same controls (excerpt):
| Category | Total | Included | Excluded |
|---|---|---|---|
| Organizational (A.5) | 37 | 37 | 0 |
| People (A.6) | 8 | 8 | 0 |
| Physical (A.7) | 14 | 13 | 1 |
| Technological (A.8) | 34 | 33 | 1 |
| Total | 93 | 91 | 2 |
Rakenne uses reference files containing the official 93 control titles from ISO 27001:2022. The agent can’t hallucinate control names because it reads them from the reference data rather than generating from memory.
Finding 2: GPT-4o Used Deprecated 2013 Control Categories
GPT-4o’s Risk Register mapped risks to control categories that don’t exist in ISO 27001:2022:
From GPT-4o’s Risk Register (excerpt):
| # | Risk | Threats | L | I | Score | Treatment | Controls |
|---|---|---|---|---|---|---|---|
| 1 | Unauthorized access to customer data | External attackers, insider threat | 4 | 5 | 20 | Mitigate | A.9 (Access Control), A.10 (Cryptography) |
| 2 | Cloud infrastructure misconfiguration | Human error | 3 | 5 | 15 | Mitigate | A.12 (Operations Security) |
| 3 | Insider threats | Disgruntled employees | 3 | 4 | 12 | Mitigate | A.7 (Human Resource Security), A.13 (Communications Security) |
A.9, A.10, A.12, A.13, A.14, A.15 — these are ISO 27001:2013 control domains. The 2022 standard restructured all controls into just four themes: A.5 (Organizational), A.6 (People), A.7 (Physical), A.8 (Technological). An auditor would flag every one of these references.
From Rakenne’s Risk Register (excerpt):
| ID | Assets | Threat | Vulnerability | L | I | Inherent | Treatment | Controls | Owner | RL | RI | Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R-01 | Customer Project Data, User Credentials & Auth Tokens, Customer PII (LGPD), Financial Records (Stripe) | Unauthorized access (External) | Weak access controls / Lack of MFA | 4 | 5 | 20 | Treat | A.5.15, A.5.16, A.5.18, A.5.34, A.8.2, A.8.3, A.8.5 | Maria Santos (CISO) | 2 | 5 | 10 |
Rakenne references specific 2022 controls (A.5.15 Access control, A.8.5 Secure authentication) rather than deprecated category-level references. Each risk maps to concrete controls with residual risk scores and named owners.
Finding 3: Rakenne Documents Reference Each Other — GPT-4o’s Don’t
Rakenne produced 18 internal cross-references between output documents. GPT-4o produced zero.
From Rakenne’s Information Security Policy (excerpt):
11. Related Documents
| Document | Reference |
|---|---|
| ISMS Scope Statement | output/ISMS-Scope-Statement.md |
| Risk Assessment Methodology | output/Risk-Assessment-Methodology.md |
| Statement of Applicability | output/Statement-of-Applicability.md |
| Risk Management Policy | output/POL-002-Risk-Management-Policy.md |
| Access Control Policy | output/POL-003-Access-Control-Policy.md |
From GPT-4o’s Information Security Policy (excerpt):
6. Related Documents
- POL-002: Risk Management Policy
- POL-003: Access Control Policy
- Statement of Applicability (SoA)
GPT-4o lists document titles but doesn’t link to actual files. An auditor tracing the thread from risk to control to policy to evidence would find no navigable references.
Finding 4: GPT-4o Policies Have Wrong Dates
From GPT-4o’s policy (excerpt):
| Field | Value |
|---|---|
| Document ID | POL-001 |
| Version | 1.0 |
| Date | 2024-05-20 |
The benchmark ran on April 1, 2026. GPT-4o dated the policy “2024-05-20” — two years in the past. This suggests the model used dates from its training data rather than the current context.
From Rakenne’s policy (excerpt):
Effective date: 2026-04-01
Next review date: 2027-04-01
Document owner: Maria Santos (CISO)
Approved by: Joao Silva (CTO)
Rakenne generates dates from the current context and includes classification level, change history, and next review date — all required by ISO 27001 Clause 7.5 for document control.
Finding 5: Management Review Quality
ISO 27001 Clause 9.3 requires 10 specific inputs for management review. Both systems addressed them, but the depth differed dramatically.
Rakenne’s action tracker (excerpt):
| Action ID | Description | Owner | Due Date | Expected Outcome | Status |
|---|---|---|---|---|---|
| MR-01-A01 | Complete retrospective access review for GCP/Okta (NC-001). | Maria Santos | 2026-09-10 | Signed review log available. | Not started |
| MR-01-A02 | Run and document a full backup restore test (NC-002). | Joao Silva | 2026-09-10 | Restore log with 100% success. | Not started |
| MR-01-A03 | Update POL-001 with new Management Sponsor signature. | Maria Santos | 2026-09-05 | Updated policy published. | Not started |
| MR-01-A04 | Launch “Security Month” training push to hit 100%. | Maria Santos | 2026-09-30 | 100% completion report generated. | Not started |
GPT-4o’s action tracker (excerpt):
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Improve incident response procedure | Maria Santos | 30 Sep 2026 | Open |
| Schedule backup verification tests | Joao Silva | 30 Jun 2026 | Open |
| Complete comprehensive access review | Ana Oliveira | 30 Jun 2026 | Open |
Rakenne’s actions link to specific audit non-conformities (NC-001, NC-002), include measurable expected outcomes (“Signed review log available”), and were validated by the action completeness checker tool to ensure every action has an owner, due date, and expected outcome. GPT-4o produced generic actions with no linkage to audit findings and no measurable outcomes.
Execution Performance
| Metric | Rakenne | GPT-4o |
|---|---|---|
| Total duration | 18 minutes | 103 minutes |
| Total tokens | 1,136,104 | 153,247 |
| Words per minute | 1,068 | 41 |
| Cost per 1M generated words | $63.19 | $119.39 |
Rakenne completed 5.7x faster despite processing 7.4x more tokens. It produced 4.6x more content — making it 47% cheaper per generated word ($63 vs $119 per 1M generated words). GPT-4o’s duration includes multiple 600-second timeouts (the SoA timed out and needed retry) and garbled output retries.
Why the Difference Exists
The gap isn’t about model intelligence — GPT-4o is a capable model. The difference comes from three architectural advantages Rakenne offers:
1. Reference Files Prevent Hallucination
Rakenne’s ISO 27001 skills include reference files with the official 93 Annex A control titles, clause requirements, and document templates. The agent reads these files before generating output. It can’t hallucinate “Review of the Information Security policy” for A.5.2 because it reads “Information security roles and responsibilities” from the reference file.
GPT-4o relies entirely on training data. Its ISO 27001 knowledge is a blend of 2013 and 2022 standard versions, producing a control naming scheme that exists in neither.
2. Validation Tools Catch Errors Before Output
Rakenne’s skills include 40+ extension tools that validate output as it’s produced:
- Risk methodology validator checks the 5x5 matrix is complete
- Risk entry validator ensures every risk has all required fields
- Residual risk validator confirms residual scores don’t exceed inherent scores
- SoA control justification audit cross-references the SoA against scope and risk assessment
- Audit impartiality checker verifies auditor independence
- Action completeness checker ensures every management review action has an owner and due date
GPT-4o has no validation tools. It produces output and moves on — there’s no mechanism to check its own work.
3. Shared Context Creates a Document System
Each Rakenne skill reads the output of previous skills. The Risk Assessment reads the Asset Inventory. The SoA reads the Risk Register. The Policy Generator reads the SoA. The Management Review reads the Audit Report.
GPT-4o processes each prompt independently. Even within the same conversation, it doesn’t read back the files it wrote — it generates new content from the conversation context, which is why cross-references are absent and data consistency degrades across phases.
Benchmark Methodology
This comparison was run using Rakenne’s automated quality scorer, which measures:
- Hallucination detection: Invalid ISO clauses, invalid Annex A controls, wrong control names (keyword overlap against official titles), legacy 2013 categories (A.9+ detection)
- Reference integrity: ISO clause accuracy, Annex A accuracy, cross-document reference validation
- Consistency analysis: Company name, role titles, terminology synonyms across documents
- Completeness scoring: Mandatory fields per document type checked against ISO requirements
- Date consistency: Year extraction with staleness threshold
- Internal cross-references: Document interconnectedness measured by file references
- Execution stats: Token usage, cost, duration from session files
Both projects used the same Rakenne backend infrastructure (workspace file I/O, session management). The only difference was the skill layer and LLM model. All output examples in this article are real, unedited excerpts from the benchmark run.
Try It Yourself
Rakenne’s ISO 27001 skills are available with no signup required. Start with the Organization Profile, then work through Risk Assessment, SoA, and Policy Generation. See how the validation tools catch issues ChatGPT would miss.
This benchmark was conducted on April 2, 2026 using Rakenne and OpenAI GPT-4o. Both platforms are updated frequently — your results may vary.
Try it yourself
Open a workspace with the skills described in this article and start drafting in minutes.
Get Started Free — No Sign-Up