Active Research · 1.74B Parameters · TPU-trained

White Room Engine V3

A transformer trained from scratch on synthetic symbolic reasoning puzzles. No natural language. No memorization. Just logic. Can it actually learn to reason?

1.74B
Parameters (AttnRes architecture)
200
Static variables (v0–v199)
18/30
Types at 100% exact match
203K
Pre-training steps on TPU

How it works

The model uses 200 static symbolic variables (v0–v199). The same names appear in every puzzle, randomly assigned to roles. It can't memorize associations — it has to learn the abstract reasoning structure itself.

Architecture

AttnRes — Attention over Depth Residuals

Instead of fixed residual accumulation, AttnRes uses softmax attention over depth. Each layer can dynamically route information from any previous block, enabling multi-hop reasoning chains without information dilution.

Training

JAX/Flax on Google TPU Research Cloud

8-chip data parallelism on TPU v6e. 203,450 pre-training steps, followed by SFT (2 epochs, 147K sequences) and GRPO reinforcement learning (v1–v8). All training data generated and verified by formal solvers — zero label errors.

Design

Zero memorization by construction

200 static variables, randomly assigned per puzzle. No entity names, no natural language, no world knowledge. If the model gets the right answer, it's because it followed the logical chain — not because it remembered something from training.

What works: 100% exact match

These reasoning types are solved perfectly. The model produces the exact correct state for every variable, every time.

100% EM

Backward Chaining

Given a query, work backwards through the rule chain to determine if it can be proven. Requires tracking which facts are needed and whether they're available.

[RULE] [IF] v8 [TRUE] [THEN] v7 [TRUE] [RULE] [IF] v154 [TRUE] [AND] v6 [TRUE] [THEN] v143 [TRUE] [RULE] [IF] v143 [TRUE] [AND] v179 [TRUE] [THEN] v139 [TRUE] [RULE] [IF] v139 [TRUE] [THEN] v107 [TRUE] [RULE] [IF] v107 [TRUE] [THEN] v50 [TRUE] ... 5 more rules [GIVEN] v6 [TRUE], v179 [TRUE], v8 [TRUE], v55 [FALSE] [QUERY] v154 ── Reasoning trace ── [TRC] 1 v55 [FALSE] [GIVEN] [TRC] 2 v23 [FALSE] [BACKTRACK] [TRC] 3 v166 [FALSE] [BACKTRACK] [TRC] ... chain continues ... [TRC] 11 v154 [FALSE] [BACKTRACK] Answer: v154 [FALSE]

The model traces backwards through 11 steps, determining that v154 cannot be proven TRUE because the chain depends on v55 which is given as FALSE.

100% EM

Causal Reasoning

Multi-hop causal chains with branching effects. The model must propagate causes through a network of rules, handling both positive and negative effects.

[RULE] [IF] v138 [TRUE] [AND] v3 [TRUE] [THEN] v23 [TRUE] [RULE] [IF] v23 [TRUE] [THEN] v71 [FALSE] [RULE] [IF] v63 [FALSE] [THEN] v166 [TRUE] [RULE] [IF] v166 [TRUE] [THEN] v140 [FALSE] ... 6 more rules [GIVEN] v138 [TRUE], v3 [TRUE], v118 [FALSE], v60 [TRUE] [QUERY] v140 ── Reasoning trace ── [TRC] 1 v138 [TRUE] + v3 [TRUE]v23 [TRUE] [YIELDS] [TRC] 2 v23v71 [FALSE] [YIELDS] [TRC] 3 v63 [FALSE]v166 [TRUE] [YIELDS] [TRC] 4 v166v140 [FALSE] [YIELDS] Answer: v140 [FALSE]

A 4-hop causal chain: v138+v3 causes v23, which negates v71, which (via default) makes v63 FALSE, which triggers v166, which finally negates v140.

100% EM

Spatial Flood Fill

Graph reachability via flood propagation. Given active source nodes and an adjacency graph, determine which nodes become active through spreading activation.

[RULE] [IF] [ACTIVE] [AND] [ADJACENT] [THEN] [TRUE] [RULE] v49 [ADJACENT] v103 [RULE] v103 [ADJACENT] v75 [RULE] v75 [ADJACENT] v107 [RULE] v49 [ADJACENT] v173 [RULE] v173 [ADJACENT] v97 [RULE] v97 [ADJACENT] v107 [GIVEN] v49 [TRUE] [QUERY] v107 ── Reasoning trace ── [TRC] 1 v49 [TRUE] [GIVEN] [TRC] 2 v103 [TRUE] [YIELDS] [TRC] 3 v173 [TRUE] [YIELDS] [TRC] 4 v75 [TRUE] [YIELDS] [TRC] 5 v97 [TRUE] [YIELDS] [TRC] 6 v107 [TRUE] [YIELDS] Answer: v107 [TRUE]

The model floods activation from v49 through the graph, reaching v107 via two independent paths (v49→v103→v75→v107 and v49→v173→v97→v107).

The open challenge: 0% exact match

These types have high partial scores (F1 0.63–0.95) — the model gets the format right but makes reasoning errors on the hardest chains. Solving these is the current research frontier.

0% EM · F1 0.72

Labyrinth Navigation

Long-range path finding through a maze of conditional rules. Requires tracking 10–30 hops through a noisy graph where most paths are dead ends. The model must find the single valid derivation chain among dozens of distractors.

[RULE] [IF] v61 [TRUE] [AND] v184 [TRUE] [THEN] v104 [TRUE] [RULE] [IF] v104 [TRUE] [AND] v176 [TRUE] [THEN] v132 [TRUE] [RULE] [IF] v132 [TRUE] [AND] v48 [TRUE] [THEN] v21 [TRUE] ... 18 more rules (noise + signal mixed) [GIVEN] v61 [TRUE], v184 [TRUE], v48 [TRUE], v176 [TRUE] [QUERY] v177 ── Required trace (14 hops) ── [TRC] 1 v61 + v184v104 [YIELDS] [TRC] 2 v104 + v176v132 [YIELDS] [TRC] 3 v132 + v48v21 [YIELDS] ... 8 more hops ... [TRC] 12 v121 + v48v103 [YIELDS] [TRC] 13 v103 + v184v177 [YIELDS] Answer: v177 [TRUE]

Why it fails: 14 sequential hops with alternating key variables (v184, v176, v48). One wrong step compounds — 95% per-token accuracy still yields 0% exact match on 150-token states.

0% EM · F1 0.68

Elementary Cellular Automata (ECA)

Simulate cellular automaton rules over multiple timesteps. Each cell's next state depends on its neighborhood (left, center, right). Requires parallel state tracking across all cells simultaneously.

[RULE] [IF] [FALSE] [FALSE] [TRUE] [THEN] [BECOMES] [TRUE] [RULE] [IF] [TRUE] [FALSE] [FALSE] [THEN] [BECOMES] [TRUE] [RULE] [IF] [TRUE] [TRUE] [TRUE] [THEN] [BECOMES] [TRUE] ... 5 more neighborhood rules [GIVEN] v33 [TRUE], v93 [TRUE], v135 [TRUE], v128 [TRUE], v138 [FALSE], v69 [FALSE] [QUERY] state after 2 timesteps ── Required: simulate 6 cells × 2 steps ── Step 0: [T, F, T, T, T, F] Step 1: apply rules to each neighborhood... Step 2: apply rules again... Answer: final cell states after 2 iterations

Why it fails: ECA requires parallel computation — all cells update simultaneously based on their neighbors. Chain-of-thought serializes this, forcing the model to track multiple cells' states in working memory. The "CoT Tax" — sequential reasoning hurts parallel operations.

0% EM · F1 0.87

Planning with Preconditions

Find a valid action sequence where each action has preconditions that must be satisfied before execution. Requires forward search through a state space with dependencies between actions.

[RULE] [IF] v113 [TRUE] [THEN] v194 [TRUE] [RULE] [IF] v113 [TRUE] [AND] v194 [TRUE] [THEN] v13 [TRUE] [RULE] [IF] v13 [TRUE] [AND] v113 [TRUE] [THEN] v33 [TRUE] [RULE] [IF] v113 [TRUE] [AND] v194 [TRUE] [THEN] v140 [TRUE] [GIVEN] v73 [TRUE], v194 [TRUE], v52 [FALSE] [QUERY] v13 ── The trap ── v13 requires v113 AND v194. v194 is given as TRUE. But v113 is NOT given. No rule produces v113. Therefore v13 cannot be derived. Answer: v13 [FALSE]

Why it fails: high F1 (0.87) means the model almost gets it right. It correctly identifies the relevant rules and variables but makes errors on precondition checking — it sometimes "assumes" a precondition is met when it isn't explicitly given.

All 30 reasoning types

The full taxonomy of reasoning capabilities tested. Each type has easy, medium, and hard variants with increasing hop counts, variable counts, and red herrings.

backward_chain100%
mixed_direction100%
causal_chain100%
spatial_flood100%
spatial_barrier100%
spatial_multi_source100%
counterfactual_simple100%
counterfactual_cascade100%
counterfactual_minimal100%
disjunctive_simple100%
disjunctive_elimination100%
disjunctive_minimal100%
void_simple100%
void_partial_chain100%
void_ambiguous100%
abductive_simple100%
abductive_multi_hyp100%
abductive_chained100%
labyrinth0%
threshold0%
transitive_simple0%
transitive_multi_rel0%
eca_simple0%
eca_multi_rule0%
planning_simple0%
planning_precon0%
naf_simple0%
defeasible0%
induction_cycle0%
exclusion_mutual0%

Real-world applications

The same reasoning structures appear everywhere. The model learns abstract logic — the domain is just a mapping layer on top.

Medical Diagnosis

Backward chaining + abductive reasoning: given symptoms, work backwards through diagnostic rules to find the most likely cause.

v0 → fever
v1 → rash
v2 → cough
v3 → measles
v4 → flu
Rule: IF v0 [TRUE] AND v1 [TRUE] AND v2 [FALSE] → v3 [TRUE]

Supply Chain Logic

Causal chains + planning: determine if an order can ship by tracing dependencies through the fulfillment pipeline.

v0 → order_packed
v1 → carrier_confirmed
v2 → pickup_scheduled
v3 → can_ship
Rule: IF v0 [TRUE] AND v1 [TRUE] AND v2 [TRUE] → v3 [TRUE]

Network Security

Spatial flood + barrier: model attack propagation through a network. Which nodes are reachable from a compromised host, given firewall rules?

v0 → compromised_host
v1 → firewall_active
v2 → database_server
v3 → user_data
Rule: IF v0 [ACTIVE] AND [ADJACENT] AND NOT v1 → v2 [TRUE]

Legal Reasoning

Defeasible logic + priority: general rules can be overridden by more specific exceptions. Higher-priority rules defeat lower ones.

v0 → contract_valid
v1 → force_majeure
v2 → obligation_holds
Rule: IF v0 [TRUE] → v2 [TRUE]
Exception: IF v1 [TRUE] → v2 [FALSE] (priority: high)

Crime Investigation

Counterfactual reasoning: if the suspect wasn't at the scene, would the evidence still hold? Test hypotheses by negating assumptions.

v0 → suspect_at_scene
v1 → dna_match
v2 → motive_established
v3 → prime_suspect
Counterfactual: IF NOT v0 → does v3 still hold?

Configuration Management

Mutual exclusion + constraint satisfaction: certain options are incompatible. Find a valid configuration that satisfies all constraints.

v0 → gpu_mode
v1 → low_power_mode
v2 → high_res_display
Constraint: NOT (v0 AND v1)
Constraint: IF v2 → v0

Research context

This is an ongoing research project exploring the limits of transformer reasoning.

The core question

Can a transformer learn systematic multi-hop reasoning if you eliminate all confounds from natural language? WRE isolates this question by removing memorization, world knowledge, and linguistic shortcuts. What remains is pure logical structure.

Key findings so far

18/30 types solved perfectly — the model genuinely learns multi-step reasoning for these categories

The CoT Tax — chain-of-thought reasoning hurts parallel operations (ECA, temporal_sync). Forcing serialization of inherently parallel computations degrades performance

RL can't teach new capabilities — extensive GRPO experiments (v1–v8, 6 curriculum configs) confirmed that if the model can't find correct solutions during sampling, the gradient is zero regardless of reward shaping

Format ≠ reasoning — the 12 failing types have high F1 (0.63–0.95), meaning the model produces correctly formatted outputs but makes logical errors. The generation circuit works; the reasoning circuit doesn't

Technical stack

Framework: JAX/Flax
Compute: Google TPU Research Cloud (v6e-8)
Architecture: 1.74B params, AttnRes, 32 layers / 8 blocks
Tokenizer: Word-level, vocab ~320 tokens
Data: 28.4M traces, formally verified
Training: 203K steps pre-train + 6K steps SFT + GRPO v1–v8