The main Strata post hand-waves the algorithm. “Split text at entropy transitions the way BSP splits 3D space at planes.” That’s the concept. This post is the implementation: the specific mechanics that turn a raw file into a structural outline.
Finding the Split Points
The algorithm checks four structural signals before touching entropy. In order:
Separator lines (// ========, # --------). Lines with 8+ repeated separator characters are unconditional split points, scored at 5.0 — the highest possible. When a developer has already marked section boundaries, the algorithm respects them. These are the pre-placed portals of the BSP.
Blank lines. Every blank line is a candidate, regardless of bracket depth. A blank line inside a class body at depth 1 competes equally with one at depth 0. This is deliberately permissive — the boundary budget (below) handles the overflow.
Bracket depth returns. When cumulative {[(/}]) delta returns to zero from positive, a top-level block just closed. This catches dense code without blank lines: minified JavaScript, compact Java. XML/HTML gets the same treatment via tag depth tracking.
Significant dedent. When indentation drops by 2+ levels (8+ spaces or 2+ tabs), something structural happened. This is how the algorithm handles Python, YAML, and other indent-significant languages where brackets don’t exist.
The key detail: every one of these boundaries gets scored by the entropy gradient at that point. A blank line between an import block and a class definition (strong entropy transition) scores up to 3.0. A blank line between two identical import statements (weak transition) scores near 0. Entropy isn’t a separate pass — it’s baked into the scoring of every structural signal.
The gradient scoring itself adapts per-region. The algorithm computes the 90th percentile of all gradients in the region and uses that as the scale, so even files with small absolute gradients still get meaningful score differentiation. A file where the strongest transition is 0.2 bits and a file where it’s 2.0 bits both produce scores spread across the full 0-1 range.
The boundary budget
At each recursion level, the algorithm limits how many boundaries survive. This is the “choose the best splitting plane” step.
At depth 0, the budget is regionLength / 100 — a 1,000-line file gets roughly 10 section-level boundaries. All candidates compete: blank lines, bracket returns, dedents, separators. The top N by entropy-weighted score survive. The rest are discarded.
Depth 0 uses a wide entropy smoothing window (proportional to file size, minimum 30 lines) to find file-level section transitions. It also rescores boundaries against this coarse-grained entropy, so that a blank line at a major section transition outranks one between similar paragraphs. Deeper levels use a 15-line window for scoring structural boundaries — still entropy-weighted, but at a finer resolution.
Separator lines are exempt from budget cuts. They always survive.
When structure finds nothing
Inside a 400-line class body with no blank lines at bracket depth 0, no bracket returns, and no significant dedents, the structural shortcuts produce zero boundaries. This is where entropy works alone.
findEntropyBoundaries() smooths per-line entropy over a window scaled to region size (clamped to 5-25 lines), computes the gradient at every point, and selects the top N strongest peaks. N scales with both region size (~1 per 40 lines) and entropy variability: the coefficient of variation (CV = standard deviation / mean) of non-blank line entropies determines how aggressively to split.
This is where the CV guard earns its keep. If CV < 0.15, entropy is too uniform for meaningful boundaries — the data looks the same everywhere. A 400-line block of SQL INSERT statements stays as one node. A 400-line class body with methods at varying complexity gets decomposed into individual methods. Same mechanism, opposite outcomes, driven by the data. Without the CV guard, entropy over-splits uniform data at arbitrary points.
The variability scaling in detail:
- CV ≈ 0.2 (moderately uniform) → ~67% of base split count → fewer, more confident splits
- CV ≈ 0.3 (mixed code) → 100% → normal splitting
- CV > 0.4 (varied class body) → up to 130% → more splits to capture the complexity
Peaks get snapped to nearby blank lines (within 3 lines) when possible, aligning entropy-detected transitions with the visual breaks humans already use.
As a last resort, if no peaks are strong enough for multi-boundary splitting, findEntropyTransition() finds the single strongest transition (minimum 0.3 bits, smoothed over a 5-line window) and makes a binary split. True BSP: one plane, two halves, recurse.
Depth control
The BSP tree doesn’t recurse uniformly. Each region’s average entropy is compared against the file-wide distribution via z-score:
- z > 0.5 (complex code, high entropy) → +1 max depth → more detail in the outline
- z < -0.5 (repetitive data, imports) → -1 max depth → less detail, summarized
Base max depth scales with file size: 1 for files under 300 lines, 4 for files up to 2,000 lines, 3 for files up to 50,000 lines, 2 for anything larger. The entropy adjustment means a complex 200-line region inside a large file can still get explored to depth 5, while a repetitive 200-line region stops at depth 3.
That handles the splitting. The next question: what happens when the tree is full of nodes that all look the same?
Collapsing Repetition
After the BSP tree is built, a post-processing pass collapses repetitive siblings — the LOD pruning from the main post. Here’s what drives the collapse decisions.
Consecutive sibling nodes are compared via Jaccard similarity on character trigrams (the set of all 3-character substrings in the joined lines of each region). Trigrams catch structural repetition even when identifiers differ: getUser(id) and getOrder(id) share trigrams like get, (id, id).
Two checks gate the collapse:
Similarity threshold (adaptive). The base threshold is 0.7 Jaccard similarity. It shifts based on the pair’s entropy relative to the file-wide distribution:
- Low-entropy pair (z < 0, repetitive data) → threshold drops toward 0.5 → collapses more easily
- High-entropy pair (z > 0, distinct logic) → threshold rises toward 0.9 → resists collapsing
The shift is 0.1 per unit z-score, clamped to [0.5, 0.9]. This means a pair of repetitive import blocks (low entropy, high similarity) collapses readily, while a pair of structurally similar but logically distinct functions (high entropy, moderate similarity) stays separate.
Size ratio (>= 0.3). The smaller sibling must be at least 30% the size of the larger. This prevents collapsing a 5-line utility function with a 200-line class just because they share some boilerplate.
Runs of 3+ consecutive siblings that pass both checks collapse into a single representative node showing the count and a sample label. The character savings can be extreme:
- 26,000 lines of C++ (60 repetitive classes): 60 nodes → 3 nodes, 167 characters total
- 45,000 lines of XML: hundreds of nodes → 100 nodes, 6,429 characters
- 607-line mixed XML (varied content): 18 nodes — structure that varies is preserved
167 characters for 26,000 lines of C++. The agent gets the shape of the file for less than a tweet.
Two cleanup passes follow:
Closing-delimiter merge. Nodes whose label is just }, );, or similar get absorbed into the preceding sibling. These are structural artifacts from bracket-depth boundary detection with no navigational value.
Render depth cap. The outline renderer caps display at 2 levels of hierarchy. The BSP tree retains deeper structure, but the outline only shows sections and their immediate children. Deeper detail is available on demand via targeted reads — the agent uses the line ranges in the outline to request exactly the code it needs.
That covers what happens inside a single file. But files don’t exist in isolation.
Cross-File Connections
The main post describes the connection graph output. Here’s the TF-IDF mechanics that produce it.
The structural tree already separates headers (node labels — function names, class names, type definitions) from bodies (everything else in the node). This header/body distinction maps directly to definitions and references without parsing imports.
Token extraction
Every identifier (3+ alphanumeric characters) is extracted from both headers and bodies. Compound identifiers are split: buildDynamicContext becomes build, Dynamic, Context. Each token is tagged with its position (header or body), file, block label, line number, and block size.
IDF filtering
Tokens appearing in more than half the indexed files are filtered out — language noise like return, const, function, import. The IDF weight (log2 of total files / files containing the token) ensures rare, meaningful identifiers drive the connections while common keywords are ignored.
Connection types
For each shared token between two files, the header/body positions determine the relationship:
| Source file | Other file | Connection type |
|---|---|---|
| Token in header (defines) | Token in body (references) | API dependency → |
| Token in body (references) | Token in header (defines) | API dependency ← |
| Token in both headers | Token in both headers | Shared interface ↔ |
| Token in body only | Token in body only | Conceptual coupling ↔ |
Connections are ranked by cumulative IDF weight of their shared tokens. The top 5 appear in the outline header. A file that shares rare identifiers like TurnExecutor and StreamingSession with another file gets a strong connection; a file that only shares config and handler gets a weak one (or none, after IDF filtering).
Persistence
The index persists to disk (in .strata/cross-file-index.json) and grows as the agent explores — it only contains files that have been read, but connections accumulate across sessions. An LRU eviction policy caps the index at 256 files. The index is invalidated and rebuilt lazily: when a file is re-analyzed (after an edit invalidates its cache), its tokens are re-extracted and the connection graph updates.
No parser. No language grammar. No AST. Entropy gradients, trigram similarity, and TF-IDF weights applied to raw text. The rest is budget management.