Extracting Model Weights: Breaking Taylor Unswift Obfuscation

This repository contains the implementation accompanying the paper: “Extracting Original Weights Post-Model Obfuscation: An Attack on Taylor Unswift.”

We demonstrate that the Taylor Unswift method, which uses a truncated Taylor expansion to mask weight matrices in Transformer-style MLP layers, can be inverted. Our attack reconstructs the original weights with near-perfect accuracy using only inputs, outputs, and partial knowledge of the model.

Overview

Taylor Unswift: A technique that obfuscates the down-projection matrix $W_{d2}$ in Transformer MLPs by embedding it into a Taylor series expansion of the activation function.
Our Contribution: We show that this obfuscation is reversible. By rearranging the forward equations and applying the Moore–Penrose pseudoinverse, we can directly solve for $W_{d2}$.
Result: The recovered weights match the original with extremely high fidelity.

Average entry-wise distance between true and recovered weights: 0.0000016806

Features

Implementation of the forward obfuscated MLP (Taylor Unswift).
Attack pipeline to reconstruct weights from input–output pairs.
Utilities for measuring recovery accuracy (L2 error, entry-wise distance).
Example scripts demonstrating full recovery on LLaMA-style MLP layers.

Method

The core recovery equation is:

$$ W_{d2} = Z^{+} , \tilde{Y}, \quad Z = \sigma(X W_{d1}) $$

where

$X$ = input batch,
$W_{d1}$ = encoder matrix,
$\tilde{Y}$ = observed obfuscated output,
$Z^{+}$ = pseudoinverse of the activation-transformed input.

This linear inversion bypasses the obfuscation and recovers the original $W_{d2}$.

Results

Average entry-wise distance between recovered and original weights:
```
0.0000016806
```
Recovery requires only moderate computation and does not degrade performance on benchmarked inference.

Repository Structure

Extracting-Weights/
│── protection/                 # Appendix script for Taylor Unswift Core
│── taylor_expansion/         # Scripts to run Taylor Unswift
│── AttackSample.ipynb       # Attack sample
│── README.md            # Project documentation

Disclaimer

This code is released for academic and research purposes only. It is intended to evaluate the robustness of weight obfuscation methods and highlight potential vulnerabilities.

Citation

If you use this work, please cite:

@article{taylorunswift_attack,
  title={Breaking Taylor Unswift: Extracting Original Weights Post Model Obfuscation},
  author={Nguyen, Ethan},
  year={2025},
  note={arXiv preprint}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
protection		protection
tylor_expansion		tylor_expansion
.gitignore		.gitignore
AttackSample.ipynb		AttackSample.ipynb
LICENSE		LICENSE
README.md		README.md
hidden-states-torch.bfloat16-n-20000-len-4096.pth.tar		hidden-states-torch.bfloat16-n-20000-len-4096.pth.tar
requirements.txt		requirements.txt
test_implementation.ipynb		test_implementation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extracting Model Weights: Breaking Taylor Unswift Obfuscation

Overview

Features

Method

Results

Repository Structure

Disclaimer

Citation

About

Uh oh!

Releases

Packages

Languages

License

ThatE10/Extracting-Model-Weights

Folders and files

Latest commit

History

Repository files navigation

Extracting Model Weights: Breaking Taylor Unswift Obfuscation

Overview

Features

Method

Results

Repository Structure

Disclaimer

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages