Papers for Vision Transformers (ViT) and Mechanistic Interpretability
Winter 2023
Advised by Dr. Blake Richards
Here is an incomplete list of papers I found helpful to read in developing more context for running mechanistic interpretability on vision transformers (ViTs).
Vision Transformers
- Ibrahim Alabdulmohsin et al. “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design”. In: arXiv preprint arXiv:2305.13035 (2023).
- Timothée Darcet et al. “Vision transformers need registers”. In: arXiv preprint arXiv:2309.16588 (2023).
- Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
- Salman Khan et al. “Transformers in vision: A survey”. In: ACM computing surveys (CSUR) 54.10s (2022), pp. 1–41.
- Muhammad Muzammal Naseer et al. “Intriguing properties of vision transformers”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 23296–23308.
- Maithra Raghu et al. “Do vision transformers see like convolutional neural networks?”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 12116–12128.
- Daquan Zhou et al. “Understanding the robustness in vision transformers”. In: International Conference on Machine Learning. PMLR. 2022, pp. 27378–27394.
Feature Visualization and Interpretability
- Shan Carter et al. “Exploring neural networks with activation atlases”. In: Distill. (2019).
- Haozhe Chen et al. “Interpreting and Controlling Vision Foundation Models via Text Explanations”. In: arXiv preprint arXiv:2310.10591 (2023).
- Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. “Interpreting CLIP’s Image Representation via Text-Based Decomposition”. In: arXiv preprint arXiv:2310.05916 (2023).
- Robert Geirhos et al. “Don’t trust your eyes: on the (un) reliability of feature visualizations”. In: arXiv preprint arXiv:2306.04719 (2023).
Mechanistic Interpretability
- Kumar K Agrawal et al. “α-ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 17626–17638.
- Trenton Bricken et al. “Towards monosemanticity: Decomposing language models with dictionary learning”. Transformer Circuits Thread, 2023.
- Arthur Conmy et al. “Towards automated circuit discovery for mechanistic interpretability”. In: arXiv preprint arXiv:2304.14997 (2023).
- Nelson Elhage et al. “A mathematical framework for transformer circuits”. In: Transformer Circuits Thread 1 (2021).
- Nelson Elhage et al. “Toy models of superposition”. In: arXiv preprint arXiv :2209.10652 (2022).
- Kevin Wang et al. “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small”. In: arXiv preprint arXiv:2211.00593 (2022).
Training Dynamics and Phase Transitions
- Eric J Michaud et al. “The quantization model of neural scaling”. In: arXiv preprint arXiv:2303.13506 (2023).
- Neel Nanda et al. “Progress measures for grokking via mechanistic interpretability”. In: arXiv preprint arXiv:2301.05217 (2023).
- Catherine Olsson et al. “In-context learning and induction heads”. In: arXiv preprint arXiv:2209.11895 (2022).
Leave a comment