CTM for solving Sudoku #39

viv92 · 2026-01-06T12:17:06Z

viv92
Jan 6, 2026

Hi,

Thanks for your great work on CTM.

I've been trying to use CTM to solve hard sudoku puzzles. I'm using the dataset from Sapient (https://huggingface.co/datasets/sapientinc/sudoku-extreme-1k) with 1k augmentations for each grid (as performed in HRM and TRM). My implementation is very similar to the provided parity example with the sequence length changed to 81 and number of classes changed to 10 (actually 11 since pad token). Another change was adding custom 2-d position RoPE to the backbone (the backbone is just an embedding layer) since both row and column indices matter for sudoku (unlike parity).

But the training seems to get stuck at a local minima with train and test accuracies (pixel) ~0.6 as shown in the figures
I'm using the following hyperparams:

backbone = embedding with custom 2-d RoPE
pairing = random-pairing
use_most_certain = True
ticks = 100
d_model = 1024
d_input = 512
heads = 16
synch Out = 512
synch Action = 512
synch Self = 32
memory length = 64
memory hidden dim = 32
batch size = 128
lr = 1e-4
dropout = 0.2
weight decay = 0.0

Do you have intuitions to why the accuracies are hitting a ceiling and the training loss not decreasing with further training?

Thanks

|backbone:embedding|pairing:random-pairing|mostCertain:True|ticks:100|dModel:1024|dInput:512|heads:16|synchOut:512|synchAction:512|synchSelf:32|memLen:64|memHDim:32|B:128|lr:0 0001|dropout:0 2|Wdecay:0 0_curve

lukedarlow · 2026-01-07T04:22:49Z

lukedarlow
Jan 7, 2026
Maintainer

This is a very interesting project.

When building the CTM I actually tested extensively on Sudoku.

First off: RoPE is relative, meaning that it has to be relative to something. Given the CTM attends to these embeddings using cross-attention, what is it "relative to"? If it just sits at position 0 or 82 then it reverts to being a standard sinusoidal embedding (just an FYI).

I have several suggestions:

Decrease the total number of ticks to 50 to enable faster convergence and bigger models.
Consider doing some transformer pre-encoding on the embeddings, using lightweight MLPs and MHSA.
If you do choose to try 2, I suggest that you use a 3D rope (rows, colums, and within grids too)
Decrease dropout
Consider doing some "pre-encoding".

I did get impressive performance when I tried this, but I will admit that the 3D rope embedding and some transformer featurisation was a necessity. Good luck!

0 replies

viv92 · 2026-01-13T22:49:35Z

viv92
Jan 13, 2026
Author

Thanks for your suggestions and clarifications. Yeah I completely overlooked the relative nature of RoPE.
I have incorporated your suggestions and updated the backbone to a xavier initialized transformer encoder consisting of 3 layers, 16 heads and 3D RoPE (rows, columns and 3x3 subgrids). So my new hyperparameters are as follows:

backbone = (mentioned above)
pairing = random-pairing
use_most_certain = True
ticks = 50
d_model = 1024
d_input = 768
heads = 16
synch Out = 1024
synch Action = 1024
synch Self = 64
synapse depth = 8 (x2)
synapse bottleneck = 16
deep_nlm = True
nlm_layernorm = False
memory length = 25
memory hidden dim = 16
batch size = 128
lr = 1e-4
dropout = 0.1
weight decay = 0.0

But the training still seems to hit the same bottleneck. Let me know if you have further suggestions or if you would like to access my code. It will be great if you could provide the code for your sudoku experiments as well.

Thanks

|backbone:transformerL3H16_RoPE3d_4|mostCertain:True|ticks:50|dModel:1024|dInput:768|heads:16|synchOut:1024|synchAction:1024|synchSelf:64|memLen:25|memHDim:16|B:128|dropout:0 1|Wdecay:0 0_curve

0 replies

viv92 · 2026-01-26T18:51:41Z

viv92
Jan 26, 2026
Author

I tried inducing a curriculum in the dataset via data augmentation where the augmented dataset was created by unmasking a fraction of masked tokens. This unmasking fraction varied as linspace(0.0, 0.9, 10). I also reduced the number of ticks to 25. The results of this experiment were as follows:

|backbone:transformerL6H16_RoPE3d_4_mix0 9_10|mostCertain:True|ticks:25|dModel:2048|dInput:768|heads:16|synchOut:1024|synchAction:1024|synchSelf:64|memLen:16|memHDim:8|B:256|dropout:0 1|Wdecay:0 0_curve

The training curves above show that the pixel accuracies are now much higher and even the puzzle accuracies reach a value of ~0.65 before the model starts to overfit. The model does learn to solve the puzzles in increasing order of difficulty (denoted by increasing values of mask ratio in the solved puzzles) but its unable to solve the hard puzzles in the original dataset (with mask ratio values > 0.65). Moreover, the trained model is often able to partially solve the hard puzzles, reducing them from a mask ratio value of 0.6 to as low as 0.05, but its unable to make further progress (as shown in the example below). A similar observation was made in this paper about HRM: https://arxiv.org/abs/2601.10679

My guess is that these observations imply CTM does not learn a recursive solution where it could reduce a difficult problem into an easier problem and reuse its learnt solution to the easier problem. Instead the learnt solution is a higher order function represented via the neural dynamics over ticks, which leads to different strategies for solving puzzles of different difficulty. I verified this by increasing the number of ticks from 25 to 50. The result of this experiment is shown below.

|backbone:transformerL6H16_RoPE3d_4_mix0 9_10|mostCertain:True|ticks:50|dModel:2048|dInput:768|heads:16|synchOut:1024|synchAction:1024|synchSelf:64|memLen:32|memHDim:16|B:128|dropout:0 1|Wdecay:0 0_curve

The learning curves above show that on increasing the number of ticks from 25 to 50, the model learns a different solution with much less overfitting.

Another interesting experiment was to change the training curriculum by varying the unmasking fraction as linspace(0.0, 0.98, 50). This new curriculum provides a much smoother gradient in puzzle difficulty. The result of this experiment is shown below.

|backbone:transformerL6H16_RoPE3d_4_mix0 98_50|mostCertain:True|ticks:25|dModel:2048|dInput:768|heads:16|synchOut:1024|synchAction:1024|synchSelf:64|memLen:16|memHDim:8|B:256|dropout:0 1|Wdecay:0 0_curve

The learning curves above show that the model now converges to a lower test accuracy (~0.5). I am not sure why is this the case since I did not expect the model to be so sensitive to the training curriculum.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTM for solving Sudoku #39

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CTM for solving Sudoku #39

Uh oh!

viv92 Jan 6, 2026

Replies: 3 comments

Uh oh!

lukedarlow Jan 7, 2026 Maintainer

Uh oh!

viv92 Jan 13, 2026 Author

Uh oh!

viv92 Jan 26, 2026 Author

viv92
Jan 6, 2026

lukedarlow
Jan 7, 2026
Maintainer

viv92
Jan 13, 2026
Author

viv92
Jan 26, 2026
Author