Papers for Vision Transformers (ViT) and Mechanistic Interpretability

2 minute read

Winter 2023
Advised by Dr. Blake Richards

Here is an incomplete list of papers I found helpful to read in developing more context for running mechanistic interpretability on vision transformers (ViTs).

Vision Transformers

Ibrahim Alabdulmohsin et al. “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design”. In: arXiv preprint arXiv:2305.13035 (2023).
Timothée Darcet et al. “Vision transformers need registers”. In: arXiv preprint arXiv:2309.16588 (2023).
Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
Salman Khan et al. “Transformers in vision: A survey”. In: ACM computing surveys (CSUR) 54.10s (2022), pp. 1–41.
Muhammad Muzammal Naseer et al. “Intriguing properties of vision transformers”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 23296–23308.
Maithra Raghu et al. “Do vision transformers see like convolutional neural networks?”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 12116–12128.
Daquan Zhou et al. “Understanding the robustness in vision transformers”. In: International Conference on Machine Learning. PMLR. 2022, pp. 27378–27394.

Feature Visualization and Interpretability

Shan Carter et al. “Exploring neural networks with activation atlases”. In: Distill. (2019).
Haozhe Chen et al. “Interpreting and Controlling Vision Foundation Models via Text Explanations”. In: arXiv preprint arXiv:2310.10591 (2023).
Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. “Interpreting CLIP’s Image Representation via Text-Based Decomposition”. In: arXiv preprint arXiv:2310.05916 (2023).
Robert Geirhos et al. “Don’t trust your eyes: on the (un) reliability of feature visualizations”. In: arXiv preprint arXiv:2306.04719 (2023).

Mechanistic Interpretability

Kumar K Agrawal et al. “α-ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 17626–17638.
Trenton Bricken et al. “Towards monosemanticity: Decomposing language models with dictionary learning”. Transformer Circuits Thread, 2023.
Arthur Conmy et al. “Towards automated circuit discovery for mechanistic interpretability”. In: arXiv preprint arXiv:2304.14997 (2023).
Nelson Elhage et al. “A mathematical framework for transformer circuits”. In: Transformer Circuits Thread 1 (2021).
Nelson Elhage et al. “Toy models of superposition”. In: arXiv preprint arXiv :2209.10652 (2022).
Kevin Wang et al. “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small”. In: arXiv preprint arXiv:2211.00593 (2022).

Training Dynamics and Phase Transitions

Eric J Michaud et al. “The quantization model of neural scaling”. In: arXiv preprint arXiv:2303.13506 (2023).
Neel Nanda et al. “Progress measures for grokking via mechanistic interpretability”. In: arXiv preprint arXiv:2301.05217 (2023).
Catherine Olsson et al. “In-context learning and induction heads”. In: arXiv preprint arXiv:2209.11895 (2022).

Share on

Twitter Facebook LinkedIn

Papers for Vision Transformers (ViT) and Mechanistic Interpretability

Vision Transformers

Feature Visualization and Interpretability

Mechanistic Interpretability

Training Dynamics and Phase Transitions

Share on

Leave a comment

You may also enjoy

Update 1

Optic Flow and PWC-Net

Meta-learning Research Overview and Paper Group

Fall 2020 NLP Journal Club: Apply to Join!