Replies: 3 comments
-
|
This is a very interesting project. When building the CTM I actually tested extensively on Sudoku. First off: RoPE is relative, meaning that it has to be relative to something. Given the CTM attends to these embeddings using cross-attention, what is it "relative to"? If it just sits at position 0 or 82 then it reverts to being a standard sinusoidal embedding (just an FYI). I have several suggestions:
I did get impressive performance when I tried this, but I will admit that the 3D rope embedding and some transformer featurisation was a necessity. Good luck! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I tried inducing a curriculum in the dataset via data augmentation where the augmented dataset was created by unmasking a fraction of masked tokens. This unmasking fraction varied as linspace(0.0, 0.9, 10). I also reduced the number of ticks to 25. The results of this experiment were as follows:
The training curves above show that the pixel accuracies are now much higher and even the puzzle accuracies reach a value of ~0.65 before the model starts to overfit. The model does learn to solve the puzzles in increasing order of difficulty (denoted by increasing values of mask ratio in the solved puzzles) but its unable to solve the hard puzzles in the original dataset (with mask ratio values > 0.65). Moreover, the trained model is often able to partially solve the hard puzzles, reducing them from a mask ratio value of 0.6 to as low as 0.05, but its unable to make further progress (as shown in the example below). A similar observation was made in this paper about HRM: https://arxiv.org/abs/2601.10679 My guess is that these observations imply CTM does not learn a recursive solution where it could reduce a difficult problem into an easier problem and reuse its learnt solution to the easier problem. Instead the learnt solution is a higher order function represented via the neural dynamics over ticks, which leads to different strategies for solving puzzles of different difficulty. I verified this by increasing the number of ticks from 25 to 50. The result of this experiment is shown below.
The learning curves above show that on increasing the number of ticks from 25 to 50, the model learns a different solution with much less overfitting. Another interesting experiment was to change the training curriculum by varying the unmasking fraction as linspace(0.0, 0.98, 50). This new curriculum provides a much smoother gradient in puzzle difficulty. The result of this experiment is shown below.
The learning curves above show that the model now converges to a lower test accuracy (~0.5). I am not sure why is this the case since I did not expect the model to be so sensitive to the training curriculum. |
Beta Was this translation helpful? Give feedback.







Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Thanks for your great work on CTM.
I've been trying to use CTM to solve hard sudoku puzzles. I'm using the dataset from Sapient (https://huggingface.co/datasets/sapientinc/sudoku-extreme-1k) with 1k augmentations for each grid (as performed in HRM and TRM). My implementation is very similar to the provided parity example with the sequence length changed to 81 and number of classes changed to 10 (actually 11 since pad token). Another change was adding custom 2-d position RoPE to the backbone (the backbone is just an embedding layer) since both row and column indices matter for sudoku (unlike parity).
But the training seems to get stuck at a local minima with train and test accuracies (pixel) ~0.6 as shown in the figures
I'm using the following hyperparams:
backbone = embedding with custom 2-d RoPE
pairing = random-pairing
use_most_certain = True
ticks = 100
d_model = 1024
d_input = 512
heads = 16
synch Out = 512
synch Action = 512
synch Self = 32
memory length = 64
memory hidden dim = 32
batch size = 128
lr = 1e-4
dropout = 0.2
weight decay = 0.0
Do you have intuitions to why the accuracies are hitting a ceiling and the training loss not decreasing with further training?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions