Publications | Sonia Joseph

2025

CVPR-W

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, and Blake Aaron Richards

In CVPR Workshop on Mechanistic Interpretability for Vision (Spotlight), 2025

Awarded arXiv Bib

Spotlight at CVPR 2025 Workshop on Mechanistic Interpretability for Vision

@inproceedings{joseph2025prisma,
  title = {Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video},
  author = {Joseph, Sonia and Suresh, Praneet and Hufe, Lorenz and Stevinson, Edward and Graham, Robert and Vadi, Yash and Bzdok, Danilo and Lapuschkin, Sebastian and Sharkey, Lee and Richards, Blake Aaron},
  booktitle = {CVPR Workshop on Mechanistic Interpretability for Vision (Spotlight)},
  year = {2025},
}

CVPR-W

Steering CLIP’s Vision Transformer with Sparse Autoencoders

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards

In CVPR Workshop on Mechanistic Interpretability for Vision, 2025

arXiv Bib

@inproceedings{joseph2025steering,
  title = {Steering CLIP's Vision Transformer with Sparse Autoencoders},
  author = {Joseph, Sonia and Suresh, Praneet and Goldfarb, Ethan and Hufe, Lorenz and Gandelsman, Yossi and Graham, Robert and Bzdok, Danilo and Samek, Wojciech and Richards, Blake Aaron},
  booktitle = {CVPR Workshop on Mechanistic Interpretability for Vision},
  year = {2025},
}

NeurIPS

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, and Danilo Bzdok

In NeurIPS (Main Conference), 2025

arXiv Bib

@inproceedings{suresh2025noise,
  title = {From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers},
  author = {Suresh, Praneet and Stanley, Jack and Joseph, Sonia and Scimeca, Luca and Bzdok, Danilo},
  booktitle = {NeurIPS (Main Conference)},
  year = {2025},
}

NeurIPS

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, and Neel Nanda

In NeurIPS (Main Conference); CVPR Workshop on Mechanistic Interpretability for Vision (Spotlight), 2025

Awarded arXiv Bib

Spotlight at CVPR 2025 Workshop on Mechanistic Interpretability for Vision

@inproceedings{venhoff2025too,
  title = {Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval},
  author = {Venhoff, Constantin and Khakzar, Ashkan and Joseph, Sonia and Torr, Philip and Nanda, Neel},
  booktitle = {NeurIPS (Main Conference); CVPR Workshop on Mechanistic Interpretability for Vision (Spotlight)},
  year = {2025},
}

CVPR-W

Decoding Vision Transformers: The Diffusion Steering Lens

Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, and Ryota Kanai

In CVPR Workshop on Mechanistic Interpretability for Vision, 2025

arXiv Bib

@inproceedings{takatsuki2025decoding,
  title = {Decoding Vision Transformers: The Diffusion Steering Lens},
  author = {Takatsuki, Ryota and Joseph, Sonia and Fujisawa, Ippei and Kanai, Ryota},
  booktitle = {CVPR Workshop on Mechanistic Interpretability for Vision},
  year = {2025},
}

CVPR-W

How Visual Representations Map to Language Feature Space in Multimodal LLMs

Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, and Neel Nanda

In CVPR Workshop on Explainable AI for Computer Vision, 2025

arXiv Bib

@inproceedings{venhoff2025visual,
  title = {How Visual Representations Map to Language Feature Space in Multimodal LLMs},
  author = {Venhoff, Constantin and Khakzar, Ashkan and Joseph, Sonia and Torr, Philip and Nanda, Neel},
  booktitle = {CVPR Workshop on Explainable AI for Computer Vision},
  year = {2025},
}

Under Review

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, and Martin Wattenberg

Under review, 2025

arXiv Bib

@article{fel2025into,
  title = {Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry},
  author = {Fel, Thomas and Wang, Binxu and Lepori, Michael A and Kowal, Matthew and Lee, Andrew and Balestriero, Randall and Joseph, Sonia and Lubana, Ekdeep S and Konkle, Talia and Ba, Demba and Wattenberg, Martin},
  journal = {Under review},
  year = {2025},
}

Under Review

Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, and Stefan Scherer

Under review, 2025

Bib

@article{qin2025sparse,
  title = {Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning},
  author = {Qin, Chuan and Venhoff, Constantin and Joseph, Sonia and Xiao, Fanyi and Scherer, Stefan},
  journal = {Under review},
  year = {2025},
}

2024

ICML-W

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Richards, Irina Rish, and Özgür Şimşek

In ICML Workshop on Mechanistic Interpretability, 2024

arXiv Bib

@inproceedings{jucys2024interpretability,
  title = {Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent},
  author = {Jucys, Karolis and Adamopoulos, George and Hamidi, Mehrab and Milani, Stephanie and Samsami, Mohammad Reza and Zholus, Artem and Joseph, Sonia and Richards, Blake and Rish, Irina and Şimşek, Özgür},
  booktitle = {ICML Workshop on Mechanistic Interpretability},
  year = {2024},
}

2023

NeurIPS-W

Mining the Diamond Miner: Mechanistic Interpretability on the Video PreTraining Agent

Sonia Joseph, Artem Zholus, Mohammad Reza Samsami, and Blake A Richards

In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023

Bib

@inproceedings{joseph2023mining,
  title = {Mining the Diamond Miner: Mechanistic Interpretability on the Video PreTraining Agent},
  author = {Joseph, Sonia and Zholus, Artem and Samsami, Mohammad Reza and Richards, Blake A},
  booktitle = {NeurIPS Workshop on Attributing Model Behavior at Scale},
  year = {2023},
}

NeurIPS-W

On the Information Geometry of Vision Transformers

Sonia Joseph, Kumar Krishna Agrawal, Arna Ghosh, and Blake Aaron Richards

In NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2023

Bib

@inproceedings{joseph2023information,
  title = {On the Information Geometry of Vision Transformers},
  author = {Joseph, Sonia and Agrawal, Kumar Krishna and Ghosh, Arna and Richards, Blake Aaron},
  booktitle = {NeurIPS Workshop on Symmetry and Geometry in Neural Representations},
  year = {2023},
}

GitHub

ViT-Prisma: A Mechanistic Interpretability Library for Vision Transformers

Sonia Joseph

GitHub repository, 2023

Bib Code