Notice: This website is mostly outdated as of 2024. A new website is coming soon. Proceed with caution regarding earlier posts.

Papers for Vision Transformers (ViT) and Mechanistic Interpretability

2 minute read

Winter 2023
Advised by Dr. Blake Richards

Here is an incomplete list of papers I found helpful to read in developing more context for running mechanistic interpretability on vision transformers (ViTs).

Vision Transformers

  1. Ibrahim Alabdulmohsin et al. “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design”. In: arXiv preprint arXiv:2305.13035 (2023).
  2. Timothée Darcet et al. “Vision transformers need registers”. In: arXiv preprint arXiv:2309.16588 (2023).
  3. Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
  4. Salman Khan et al. “Transformers in vision: A survey”. In: ACM computing surveys (CSUR) 54.10s (2022), pp. 1–41.
  5. Muhammad Muzammal Naseer et al. “Intriguing properties of vision transformers”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 23296–23308.
  6. Maithra Raghu et al. “Do vision transformers see like convolutional neural networks?”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 12116–12128.
  7. Daquan Zhou et al. “Understanding the robustness in vision transformers”. In: International Conference on Machine Learning. PMLR. 2022, pp. 27378–27394.

Feature Visualization and Interpretability

  1. Shan Carter et al. “Exploring neural networks with activation atlases”. In: Distill. (2019).
  2. Haozhe Chen et al. “Interpreting and Controlling Vision Foundation Models via Text Explanations”. In: arXiv preprint arXiv:2310.10591 (2023).
  3. Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. “Interpreting CLIP’s Image Representation via Text-Based Decomposition”. In: arXiv preprint arXiv:2310.05916 (2023).
  4. Robert Geirhos et al. “Don’t trust your eyes: on the (un) reliability of feature visualizations”. In: arXiv preprint arXiv:2306.04719 (2023).

Mechanistic Interpretability

  1. Kumar K Agrawal et al. “α-ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 17626–17638.
  2. Trenton Bricken et al. “Towards monosemanticity: Decomposing language models with dictionary learning”. Transformer Circuits Thread, 2023.
  3. Arthur Conmy et al. “Towards automated circuit discovery for mechanistic interpretability”. In: arXiv preprint arXiv:2304.14997 (2023).
  4. Nelson Elhage et al. “A mathematical framework for transformer circuits”. In: Transformer Circuits Thread 1 (2021).
  5. Nelson Elhage et al. “Toy models of superposition”. In: arXiv preprint arXiv :2209.10652 (2022).
  6. Kevin Wang et al. “Interpretability in the wild: a circuit for indirect object identification in gpt-2 small”. In: arXiv preprint arXiv:2211.00593 (2022).

Training Dynamics and Phase Transitions

  1. Eric J Michaud et al. “The quantization model of neural scaling”. In: arXiv preprint arXiv:2303.13506 (2023).
  2. Neel Nanda et al. “Progress measures for grokking via mechanistic interpretability”. In: arXiv preprint arXiv:2301.05217 (2023).
  3. Catherine Olsson et al. “In-context learning and induction heads”. In: arXiv preprint arXiv:2209.11895 (2022).

Leave a comment