Preprint · v0.1 · 2026-05-25

ADHD: Parallel Divergent Ideation for Coding Agents Tree-of-thought with cognitive-frame branching, generator–critic separation, and pruning.

Udit Akhouri Raj · github.com/UditAkhourii/adhd

Abstract

Large language model agents exhibit premature convergence: when asked to ideate on an open-ended design problem they default to the first plausible candidate and polish it, producing competent but forgettable output. We introduce ADHD, a method that fans out N parallel divergent branches under structurally different cognitive frames (e.g. regulator, speedrunner, biology, $0 budget), with no cross-branch context, then converges via a separate critic pass that scores, clusters, and deepens only the top-K survivors. ADHD differs from Chain-of-Thought along three load-bearing axes: branches are isolated rather than shared, branching is driven by vantage-point reframing rather than next-step variation, and the generator/critic split is enforced mechanically (separate LLM calls with opposite system prompts) rather than promised by a single context. Across six open-ended engineering problems judged by an independent LLM-as-judge, ADHD wins 5/6 against a single-shot baseline at the same model, with mean improvements of +5.17 in novelty, +4.17 in breadth, and +7.67 in trap detection on a 0–10 rubric. We argue ADHD is the right inference-time structure for creative, interdisciplinary, and design-shaped tasks where the failure mode is not wrong but obvious.

Introduction

A modern LLM, prompted with "give me a few ways to do X", will almost always produce the same three answers a senior practitioner would. This is not a bug at the token level — those are the high-probability completions — but it is a failure at the task level whenever the user's purpose is to escape the high-probability answer. We call this failure mode premature convergence: the model evaluates as it generates, the early tokens anchor the late tokens, and the output is the centroid of the training distribution dressed up as a recommendation.

Premature convergence is most costly in exactly the regimes where ideation matters most: architecture decisions, API and SDK design, debugging fuzzy intermittent failures, refactor planning, naming, positioning, and any task whose deliverable is a set of viable options rather than a single answer. In these tasks the textbook answer is often the trap, and the interesting answer lives in what the original divergent-ideation skill calls "the awkward middle, past the first three".^[1]

Existing inference-time methods address adjacent problems. Chain-of-Thought (CoT)^[2] makes one head reason more slowly along one path, exposing the intermediate steps so the model does not skip them. Tree-of-Thought (ToT)^[3] makes one head search over candidate next-steps with backtracking. Self-consistency sampling^[4] draws multiple traces and majority-votes. Mixture-of-Agents^[5] and multi-agent debate^[6] sample multiple full responses and aggregate. All four are valuable, but all four optimise for correctness on a closed answer space. None of them is shaped right for the open-ended case where there is no ground truth, no test you can run on a partial, and the metric of interest is range of non-obvious viable options.

We propose ADHD: a method that produces such a range by structurally preventing the generator from converging during divergence, and only converging in a separate, posterior critic pass. ADHD borrows the tree structure of ToT but replaces its branching driver (next-step search) with vantage-point reframing, and replaces ToT's intermingled generator/evaluator with two strictly separated LLM calls. The result, on the evaluations we report below, is a method that wins clearly against a single-shot baseline on novelty, breadth, and trap detection — the dimensions premature convergence destroys.

CoT makes one head think slower. ToT makes one head search wider. ADHD makes many heads think differently, in parallel, then has a critic pick.

Single-trace methods

Chain-of-Thought^[2] elicits intermediate reasoning by prompting (or fine-tuning) the model to "think step by step". It is decisively useful on multi-step problems with verifiable answers (arithmetic, symbolic reasoning) but it is a single linear trace: each step is conditioned on the previous, which is precisely the anchoring dynamic ADHD is designed to break. Self-Consistency^[4] samples many CoT traces and majority-votes the final answer; it improves robustness but assumes a discrete correct answer, which ideation does not have.

Multi-branch search methods

Tree-of-Thought^[3] generalises CoT to a tree of intermediate "thoughts" with explicit search (BFS or DFS) and an evaluator function that scores partial states. ToT is the closest neighbour of ADHD, and ADHD can be described as a ToT variant. The differences are not cosmetic: (i) ToT's branches share a single conversational context so anchoring still occurs across steps, (ii) ToT's branching driver is next-step variation (try numeric value x vs y), which produces nearby ideas rather than structurally different ones, and (iii) ToT typically interleaves generator and evaluator within the same model call.

Multi-agent and aggregation methods

Multi-Agent Debate^[6] has multiple instances critique each other across rounds; this can improve factuality but converges aggressively toward consensus, which is the opposite of what ideation needs. Mixture-of-Agents^[5] stacks layers of LLMs that read each other's outputs; it improves quality on benchmarks but, again, the per-layer aggregation step is designed to converge. ReAct^[7] interleaves reasoning with tool use, which is orthogonal to the ideation question we address.

Method-acting and persona prompting

A separate strand of work assigns the model a role — "you are an expert X" — to bias output style or domain knowledge. ADHD's cognitive frames superficially resemble this but differ in intent: frames are not chosen for expertise but for structural distortion. The "10-year-old" frame is not asked to be correct; it is asked to ignore convention. The "speedrunner" frame is not asked to be authoritative; it is asked to look for glitches. Frames are vantage-point operators, not credentials.

Source: the Divergent Ideation skill

ADHD operationalises a written skill on divergent ideation^[1] that prescribes a divergence/convergence loop with explicit anti-patterns ("convergence disguised as divergence", "weird-for-weird's-sake with no convergence", "refusing to commit"). Our contribution is to turn that prose into a mechanically enforceable runtime: separate LLM calls, isolated branches, and scoring-then-deepening rather than scoring-during-generating.

Method

ADHD is a two-phase loop with a hard mechanical separation between phases.

Fig. 1 — The ADHD loop. N isolated divergence calls (left) under different cognitive frames; a separate scoring/clustering call (centre); top-K deepening calls that expand the survivors (right). Branches do not share context during Phase 1.

Phase 1 — Diverge

Given a problem p, we select N frames F₁, …, F_N from a library of 15 (e.g. hardware engineer, regulator, biology, logistics, game design, markets, inversion, $0 budget, remove the load-bearing assumption, speedrunner, ant colony, 3am on-call). For each frame we make a fresh, parallel LLM call with:

a generator-only system prompt that forbids evaluation, ranking, or hedging,
the problem statement,
the frame's vantage prompt (e.g. "You think in latency, memory layout, and physical constraints. Re-ask this problem as if it were a hardware/firmware problem"),
a JSON-only output instruction asking for k short, distinct candidate ideas.

Critically, the N calls do not share context. The regulator branch never reads what the speedrunner branch produced. Anchoring is eliminated by construction, not by prompting.

The frame library is tagged (code, design, general, wild). When codeMode is enabled (the default) we bias selection toward engineering-relevant tags but always reserve one slot for a wild frame to preserve range.

Phase 2 — Focus

With the pool of N × k ideas in hand, we run three further calls:

Score. A single critic call scores every idea on three axes (novelty, viability, fit), each 0–10, and may flag any idea as a trap with a one-line reason ("looks attractive but is …"). The critic uses a system prompt that explicitly asks for adversarial reading.
Cluster. A second critic call groups ideas into 3–6 clusters by their underlying angle, not their surface keywords ("remove-the-server plays", "cache-shaped plays", "batched-time-window plays"). This step surfaces the shape of the candidate space.
Deepen. For the top-K ideas (ranked by weighted score, excluding traps), we make K parallel focus calls. Each produces (a) a 4–8 sentence sketch of how the idea would work, (b) the load-bearing risk, (c) the first concrete step a builder would take, and (d) 3–5 child ideas (variations, hybrids, unlocks). These child ideas become the second-level connections — the "connecting the dots" pass.

The final output is the wide set (clustered), a 2–4 idea shortlist with the non-obvious-but-viable pick flagged explicitly, the trap list, the deepened sketches with their child ideas, and one wildcard provocation drawn from the highest-novelty leaf.

Why the separations matter

Three invariants are load-bearing. Removing any of them collapses ADHD into a method that already exists.

Isolation, not search. CoT and ToT branches share a context window; by step 4 the model has anchored on what it wrote in steps 1–3. ADHD's N branches are N distinct LLM calls with no shared history. Anchoring is mechanically impossible across branches.
Frames, not next-step variation. ToT typically varies the next move within a search problem. ADHD varies the entire vantage point of the generator. It is not "what step comes next from here"; it is "re-ask the whole question as if you were an immune system". This produces structurally different ideas, not nearby ones, which is the prerequisite for surfacing cross-domain transplants.
Generator–critic split is mechanical. The generator system prompt forbids evaluation. The critic system prompt forbids generation. They are different calls. A single model evaluating as it generates is exactly the "critic strangles the generator" failure the original skill warns against^[1].

Why prompted alternatives don't replicate parallel divergence

A natural objection is that ADHD merely spends more tokens: if a single agent is instructed to "consider alternatives" or "list five options first", is the result not the same wide set? It is not, and the reason is the isolation invariant above rather than a difference of degree. A single chain instructed to enumerate alternatives generates them into one shared context, sequentially: it anchors on whichever candidate it emits first and reasons forward from there, so the attention pattern pulls each subsequent "alternative" toward the first even while nominally exploring others. The enumerated options are variations on a theme, not structurally distinct angles — "list N options" is Chain-of-Thought with a numbered list, and inherits its anchoring. ADHD's N branches share no context during divergence, so there is no first answer to anchor on; distinctness is produced by construction (isolation plus frame distortion) rather than requested and hoped for. Put compactly: prompted enumeration varies the output; ADHD varies the generator, and only the latter escapes the anchor.

We have informal external corroboration: a practitioner reports that running the same prompt five times across parallel agents surfaces actually-distinct ideas roughly three times as often as a single agent prompted to "list 5 options first". We report this as directional rather than measured — it is one practitioner's anecdote, not a controlled result. A controlled comparison of parallel-N branches against a single-prompt-list-N condition on the evaluation suite would quantify the gap directly; we leave it as future work.

Implementation

We implement ADHD as a Node/TypeScript library on top of the Claude Agent SDK^[8]. The package ships a CLI (adhd "<problem>"), a programmatic API (run(opts) → RunResult), and a frame library that is extensible in five lines per frame. A default run uses N = 5 frames, k = 6 ideas per frame, K = 3 deepened survivors, concurrency 4. Total LLM calls per run: N divergences + 1 score + 1 cluster + K deepens ≈ 10. Call count is a poor proxy for cost, however; see Cost for the substrate-multiplied token accounting.

Each phase uses a system prompt tuned for its posture. Divergence prompts begin with "You are in DIVERGENT mode. You are a generator, not a critic" and enumerate constraints (JSON only, no prose, no ranking, the first three obvious answers are banned, push past them). The scoring prompt begins with "You are in CONVERGENT mode. You are now the critic" and supplies the rubric. The deepen prompt begins with "You are in FOCUS mode". These prompts are designed to be self-evidently incompatible, so the model cannot drift between them within a single call.

The implementation is roughly 600 lines of TypeScript and is released under MIT licence at github.com/UditAkhourii/adhd. The package is published to npm as adhd-agent and is installable with npm install adhd-agent (library) or npm install -g adhd-agent (CLI binary adhd).

Evaluation

Setup

We compare ADHD against a single-shot baseline at the same underlying model. The baseline receives a senior-engineer system prompt and the problem statement and is asked to produce a useful answer with approaches, tradeoffs, and a recommendation. This baseline is deliberately strong: it is what an experienced practitioner would actually do at a chat prompt.

Problems

Six open-ended engineering problems were used, chosen to span systems, distributed systems, UX/reliability, debugging, refactor, and naming:

lru-100ms — thread-safe LRU cache surviving restart with ≤100 ms of write loss.
llm-hang-cli — retry/timeout/UX strategy for a CLI whose LLM occasionally hangs 90 s.
rate-limit-leader — rate limiter correct across leader election with no warm handoff.
fuzzy-bug — 0.1% intermittent API timeouts, no obvious pattern. Generate hypothesis classes.
monolith-split — decomposition strategy for a 200k-line Rails monolith.
naming-feature-flag — names for a feature-flag service signalling control and reversibility.

Judging

Each pair (ADHD output, baseline output) is scored by an independent LLM-as-judge call with a skeptical staff engineer system prompt. The judge sees both outputs blinded as A/B in randomised order per problem (recorded for de-bias), and scores on five dimensions: breadth (range of structurally distinct angles), novelty (non-obvious-but-viable ideas), trap_detection (does it name ideas that look good but aren't, with reasons), actionability (does the top pick have a sketch + named risk + first concrete step), and builder_usefulness (which is more useful to the engineer who actually has to ship). Each dimension is 0–10. The judge then declares an overall winner of A, B, or tie, and writes a one-line summary.

To reduce same-model bias, the judge system prompt is explicit about adversarial reading and the rubric. A/B labels are de-anonymised only after all six runs are complete. We acknowledge that LLM-as-judge can favour outputs of similar surface character to its own training distribution; we address this in §7.

Results

Aggregate results across the six problems (mean score per dimension):

Dimension	ADHD	Baseline	Δ
breadth	9.00	4.83	+4.17
novelty	7.83	2.67	+5.17
trap_detection	9.50	1.83	+7.67
actionability	9.50	6.50	+3.00
builder_usefulness	7.67	6.83	+0.83

Per-problem overall winners: ADHD wins on lru-100ms, rate-limit-leader, fuzzy-bug, monolith-split, and naming-feature-flag. The baseline wins on llm-hang-cli. Final tally: ADHD 5W / 1L / 0T. Full per-problem verdicts and transcripts are committed to the repository as EVALS.md and bench/results.json.

Analysis

Where ADHD wins

The largest delta is trap detection (+7.67). The baseline rarely names ideas that look good but are wrong; ADHD's scoring pass explicitly flags traps with reasons. Two examples from the evaluation runs:

On llm-hang-cli, ADHD flagged a "multi-rail redundancy" idea (fire identical requests to 2–3 endpoints simultaneously) as a trap because it doubles or triples API costs and may violate rate limits — a real concern any builder would otherwise discover only at the bill.
On lru-100ms, ADHD flagged a "treat disk as NVRAM with timer-interrupt commits" idea as a trap because timer-interrupt precision under load requires a kernel-level implementation most user-space apps cannot guarantee. The baseline included a similar idea uncritically.

The novelty delta (+5.17) is driven by the cross-domain frames. The most striking example, on llm-hang-cli, is a first-byte vs chunk-idle dual timer design — distinguishing NEVER_CONNECTED, STALLED_MID_STREAM, and COMPLETED_SLOW failure modes — which the baseline did not surface and which is, in our judgement, the correct architecture for streaming LLM clients. It arose from the regulator frame's question "what must be distinguishable in the audit trail?". Similarly, on fuzzy-bug, the biology frame surfaced a "fever-response circuit-breaker" idea that resolves to progressive degradation tiers (Opus → Sonnet → Haiku → cached) — concrete, shippable, and not in the baseline.

The breadth delta (+4.17) reflects the cluster pass: the baseline tends to list four or five variations on a single underlying angle, while ADHD surfaces 6–9 structurally different angles per problem.

Where ADHD loses

The one loss, on llm-hang-cli, is informative. The judge wrote: "B [ADHD] explores vastly more creative territory and expertly identifies traps, but A [baseline] delivers a pragmatic, immediately implementable solution that an engineer can ship today." ADHD scored higher on breadth, novelty, and trap detection on this problem, but lost on builder_usefulness — the judge preferred the baseline's tighter, polished, ship-today shape over ADHD's richer but rougher pile.

This matches the failure mode we expect. When the problem is well-understood with a known good answer, a single polished answer beats a wide set with the same answer buried in it. ADHD pays its cost in presentation overhead; that cost is worth it precisely when the wide set contains ideas the polished answer missed. On llm-hang-cli the baseline already knew the right answer; on the other five problems it did not.

Cost

A default ADHD run uses ≈10 LLM calls (5 divergence + 1 score + 1 cluster + 3 deepen) versus 1 for the baseline. Wall-clock latency at concurrency 4 is typically 30–90 s. Call count, however, understates token cost. Because each divergence branch is an isolated context, the base substrate that prefixes every call (system prompt, and — when ADHD runs as a skill inside an agent session — the host's context and tool definitions) is paid once per branch, before any divergence token is generated. The honest cost is therefore

cost ≈ N × (base_context + branch_output) + critic_context + K × deepen_context,

where the N × multiplier on base_context dominates whenever the substrate is large. Two regimes follow: the standalone library carries a minimal substrate (problem + frame only), so its premium is close to the naive 5–10× call ratio; the skill-inside-an-agent regime re-loads the host substrate on every branch, so its premium is higher and grows with N. We frame this honestly: ADHD is for decision points, not inner loops. The right mental model is spend a small, computable token premium to widen a high-stakes decision — the dollar figure depends on N, the base substrate, and current per-token pricing, and should be computed per deployment rather than quoted as a single headline number.

Discussion and limitations

Limitations

Same-model judging. Our LLM-as-judge runs on the same model family as the generator. We mitigated with adversarial system prompts and randomised A/B order, but we cannot exclude familiarity bias entirely. A useful follow-up is cross-model judging (e.g. judge with a different vendor's model) and human ratings on a held-out subset.

Small problem set. Six problems is enough to see consistent direction but not enough to make strong quantitative claims. We released the harness so the set can be extended; adding a new problem is a four-line change.

Frame library is hand-authored. The 15 frames in the current library reflect our judgement about which vantage points produce distinct outputs on engineering problems. A frame can fail silently — producing paraphrases of another frame's ideas — without the harness catching it. Frame-quality evaluation is future work.

Confounded by deepen quality. The actionability delta is partly explained by the deepen pass, which gives ADHD a structural advantage (sketch + risk + first step) that the baseline's free-form prose does not enforce. A fairer ablation would equip the baseline with the same output schema; we expect ADHD's lead to shrink on this dimension but not on breadth, novelty, or trap detection.

Domain. All six problems are engineering-shaped. The frame library is biased toward engineering when codeMode is enabled. Whether ADHD's wins generalise to product strategy, scientific brainstorming, or pure creative writing is plausible but not demonstrated here.

When to use ADHD

ADHD is the right tool when (i) the problem is open-ended, (ii) the cost of the obvious answer being wrong is high, (iii) the user cannot articulate a ground truth in advance, and (iv) breadth and trap detection are worth a 5–10× LLM-call premium. It is the wrong tool for lookup questions, bug fixes with a known root cause, and any task where the answer is one search query away. The one-sentence test we propose: if a junior would Google it and find the answer, baseline wins; if a senior would say "hm, let me think about this differently for a minute", ADHD is the moment that replaces.

Use inside larger agents

The most interesting application surface is not the standalone CLI but as a callable subroutine inside larger coding agents at decision points. A planning agent at a branch point with high uncertainty, a code-review agent asked "what could go wrong here", a debugging agent stuck after three patches, and a test-generation agent searching for adversarial inputs all benefit from a pause-and-widen step before committing to the next move. The library API run({...}) is designed for this.

Conclusion

We have argued that LLM coding agents systematically converge prematurely on open-ended ideation tasks, and that this failure is structural rather than capability-bounded. We presented ADHD, an inference-time method that prevents convergence during a divergence phase by running N isolated parallel branches under cognitive-frame distortions, and converges in a separate critic pass that scores, clusters, and deepens only the survivors. ADHD differs from existing tree-of-thought methods along three load-bearing axes: branch isolation, frame-based branching, and mechanical generator–critic separation. On six open-ended engineering problems, ADHD wins 5/6 against a single-shot baseline at the same model, with the largest gains concentrated in trap detection, novelty, and breadth — the dimensions premature convergence destroys. The implementation is small, open-source, and intended to be used as a subroutine inside larger agents at decision points where the cost of the obvious answer is high. The full release is available at github.com/UditAkhourii/adhd.

References

Divergent Ideation skill (source spec). Reproduced in SKILL.md at the project repository. link
Wei, J., Wang, X., Schuurmans, D., et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
Yao, S., Yu, D., Zhao, J., et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023. arXiv:2305.10601
Wang, X., Wei, J., Schuurmans, D., et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. arXiv:2203.11171
Wang, J., Wang, J., Athiwaratkun, B., et al. Mixture-of-Agents Enhances Large Language Model Capabilities. 2024. arXiv:2406.04692
Du, Y., Li, S., Torralba, A., et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate. ICML 2024. arXiv:2305.14325
Yao, S., Zhao, J., Yu, D., et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629
Anthropic. Claude Agent SDK documentation. docs.claude.com