A transformer trained from scratch on synthetic symbolic reasoning puzzles. No natural language. No memorization. Just logic. Can it actually learn to reason?
The model uses 200 static symbolic variables (v0–v199). The same names appear in every puzzle, randomly assigned to roles. It can't memorize associations — it has to learn the abstract reasoning structure itself.
Instead of fixed residual accumulation, AttnRes uses softmax attention over depth. Each layer can dynamically route information from any previous block, enabling multi-hop reasoning chains without information dilution.
8-chip data parallelism on TPU v6e. 203,450 pre-training steps, followed by SFT (2 epochs, 147K sequences) and GRPO reinforcement learning (v1–v8). All training data generated and verified by formal solvers — zero label errors.
200 static variables, randomly assigned per puzzle. No entity names, no natural language, no world knowledge. If the model gets the right answer, it's because it followed the logical chain — not because it remembered something from training.
These reasoning types are solved perfectly. The model produces the exact correct state for every variable, every time.
Given a query, work backwards through the rule chain to determine if it can be proven. Requires tracking which facts are needed and whether they're available.
The model traces backwards through 11 steps, determining that v154 cannot be proven TRUE because the chain depends on v55 which is given as FALSE.
Multi-hop causal chains with branching effects. The model must propagate causes through a network of rules, handling both positive and negative effects.
A 4-hop causal chain: v138+v3 causes v23, which negates v71, which (via default) makes v63 FALSE, which triggers v166, which finally negates v140.
Graph reachability via flood propagation. Given active source nodes and an adjacency graph, determine which nodes become active through spreading activation.
The model floods activation from v49 through the graph, reaching v107 via two independent paths (v49→v103→v75→v107 and v49→v173→v97→v107).
These types have high partial scores (F1 0.63–0.95) — the model gets the format right but makes reasoning errors on the hardest chains. Solving these is the current research frontier.
Long-range path finding through a maze of conditional rules. Requires tracking 10–30 hops through a noisy graph where most paths are dead ends. The model must find the single valid derivation chain among dozens of distractors.
Why it fails: 14 sequential hops with alternating key variables (v184, v176, v48). One wrong step compounds — 95% per-token accuracy still yields 0% exact match on 150-token states.
Simulate cellular automaton rules over multiple timesteps. Each cell's next state depends on its neighborhood (left, center, right). Requires parallel state tracking across all cells simultaneously.
Why it fails: ECA requires parallel computation — all cells update simultaneously based on their neighbors. Chain-of-thought serializes this, forcing the model to track multiple cells' states in working memory. The "CoT Tax" — sequential reasoning hurts parallel operations.
Find a valid action sequence where each action has preconditions that must be satisfied before execution. Requires forward search through a state space with dependencies between actions.
Why it fails: high F1 (0.87) means the model almost gets it right. It correctly identifies the relevant rules and variables but makes errors on precondition checking — it sometimes "assumes" a precondition is met when it isn't explicitly given.
The full taxonomy of reasoning capabilities tested. Each type has easy, medium, and hard variants with increasing hop counts, variable counts, and red herrings.
The same reasoning structures appear everywhere. The model learns abstract logic — the domain is just a mapping layer on top.
Backward chaining + abductive reasoning: given symptoms, work backwards through diagnostic rules to find the most likely cause.
Causal chains + planning: determine if an order can ship by tracing dependencies through the fulfillment pipeline.
Spatial flood + barrier: model attack propagation through a network. Which nodes are reachable from a compromised host, given firewall rules?
Defeasible logic + priority: general rules can be overridden by more specific exceptions. Higher-priority rules defeat lower ones.
Counterfactual reasoning: if the suspect wasn't at the scene, would the evidence still hold? Test hypotheses by negating assumptions.
Mutual exclusion + constraint satisfaction: certain options are incompatible. Find a valid configuration that satisfies all constraints.
This is an ongoing research project exploring the limits of transformer reasoning.
Can a transformer learn systematic multi-hop reasoning if you eliminate all confounds from natural language? WRE isolates this question by removing memorization, world knowledge, and linguistic shortcuts. What remains is pure logical structure.
• 18/30 types solved perfectly — the model genuinely learns multi-step reasoning for these categories
• The CoT Tax — chain-of-thought reasoning hurts parallel operations (ECA, temporal_sync). Forcing serialization of inherently parallel computations degrades performance
• RL can't teach new capabilities — extensive GRPO experiments (v1–v8, 6 curriculum configs) confirmed that if the model can't find correct solutions during sampling, the gradient is zero regardless of reward shaping
• Format ≠ reasoning — the 12 failing types have high F1 (0.63–0.95), meaning the model produces correctly formatted outputs but makes logical errors. The generation circuit works; the reasoning circuit doesn't
Framework: JAX/Flax
Compute: Google TPU Research Cloud (v6e-8)
Architecture: 1.74B params, AttnRes, 32 layers / 8 blocks
Tokenizer: Word-level, vocab ~320 tokens
Data: 28.4M traces, formally verified
Training: 203K steps pre-train + 6K steps SFT + GRPO v1–v8