TECH

Re-Architecting a Knowledge Base: Flat Files to Schema-Validated YAML

15 May 20263 min read

Knowledge BaseYAMLArchitectureMigrationSchemaAutomation

The Problem

The knowledge base had grown organically — a flat folder of markdown files with no structure, no validation, and no way to know what was current or stale. Key problems:

No schema — every file had its own format
No index — finding anything required searching
Token budget invisible — no tracking of context window usage
No validation — malformed entries went undetected
Stale files mixed — current and archived content lived side by side

The Solution: 4-Tier YAML Knowledge Base

Structure

03_Knowledge_Base/
├── 00_directives/       # Core rules, principles, never-changing
├── 01_hot/              # Active context, current session
├── 02_warm/             # Reference, lessons, research
│   ├── reference/
│   ├── research/
│   └── templates/
└── 03_cold/             # Archive, obsolete, raw backups
    ├── raw/
    └── archive/

Each tier has a specific purpose and token budget:

Tier	Budget	Purpose
`00_directives`	300t	Core rules — always loaded
`01_hot`	640t	Active session context
`02_warm`	1,000t	Per-session reference cap
`03_cold`	460t	Archive — loaded on demand
Total	2,400t

Schema Validation

Every YAML file follows a validated schema with required fields, type checking, and cross-references. A validate-schema.js script runs before commits:

node _scripts/validate-schema.js --verbose

Auto-Generated Compass

A regenerate-compass.js script scans the entire knowledge base and produces _compass.yaml — an auto-generated index of all entries with paths, summaries, and last-updated timestamps.

Migration Stats

Metric	Value
YAML files created	32
Markdown files archived	20
Agent files updated	13
Stale paths fixed	6
Files deleted	4
Restore point size	Full pre-migration snapshot

Token Budget Fix

A critical bug was discovered: the warm pool was being calculated as the sum of all warm files, when it's actually a per-session cap. The fix ensures the warm pool allocation is accurate — if you reference 5 files in a session, each counts toward the 1,000t cap, not the total of all warm files.

Audit Results

Baseline over budget: 150t (2,550 / 2,400) — acceptable with model fallback on free tier
Hot entries (860t) exceed allocation (640t) — needs pruning in next session
All schemas valid — no validation errors

Key Results

32 structured YAML files replacing flat markdown
Schema validation enforced pre-commit
Auto-generated index (_compass.yaml) always up to date
Token budget accurately tracked per tier
13 agent files updated with new KB paths
Restore point created for rollback safety
20 stale markdown files archived or deleted

Takeaways

Structure prevents entropy. A 4-tier system with clear boundaries keeps information findable.
Validate early, validate often. Schema validation catches mistakes before they compound.
Token budgets need monitoring. It's easy to exceed limits without visibility.
Auto-generated indexes scale. Manual indexes become stale; regenerated ones don't.
Always snapshot before migration. A restore point saved the session when a path broke.