About Me

I’m a second-year PhD student at the University of Sheffield, researching language model interpretability. I study the internals of neural networks to understand how they work. I’m fortunate to be supervised by Xingyi Song and Kalina Bontcheva. I’ve also collaborated with Wes Gurnee and Neel Nanda from Google DeepMind, studying how LLMs use specialised neurons to regulate their uncertainty.

Prior to pursuing my doctoral degree, I studied English Literature (BA) at the University of Cambridge. You can learn about some of my favourite books and poems here.

Research Interests

I think inspecting the internals of AI models can help us understand their capabilities and limitations. Lately, I’ve been using mechanistic interpretability methods to study uncertainty and confidence in LLMs, with the goal of improving model calibration, monitoring, and reliability.

News

Nov 2024: I was selected as one of the top reviewers for NeurIPS 2024!
Mar 2024: I co-organized CLEF-2024 CheckThat! Shared Task 6 on adversarial text generation.
Nov 2023: I’ll be participating in the ML Alignment & Theory Scholars (MATS) Program, doing mech interp under the mentorship of Neel Nanda.
Jul 2023: I’ll be attending the Lisbon Machine Learning Summer School (LxMLS) in Portugal. Excited to hear from speakers such as Noah Smith, Yejin Choi and Kyunghyun Cho!
Feb 2023: Our SemEval 2023 Task 3 submission achieved the best average rank for News Frame Classification and first place in 3 languages. We were nominated for Best SemEval paper and wrote a follow-up paper.

Selected Publications

Confidence Regulation Neurons in Language Models
Alessandro Stolfo*, Ben Wu*, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
NeurIPS 2024 [arxiv]
Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science
Yida Mu, Ben P Wu, William Thorne, Ambrose Robinson, Nikolaos Aletras, Carolina Scarton, Kalina Bontcheva, Xingyi Song
LREC-COLING 2024 [Paper] [arxiv]
Overview of the CLEF-2024 CheckThat! Lab task 6 on robustness of credibility assessment with adversarial examples (InCrediblAE)
Piotr Przybyła, Ben Wu, Alexander Shvets, Yida Mu, Kim Cheng Sheang, Xingyi Song, Horacio Saggion
CLEF 2024 [Paper]
Don’t Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Labels
Ben Wu, Yue Li, Yida Mu, Carolina Scarton, Kalina Bontcheva, Xingyi Song
EMNLP Findings 2023 [Paper] [arxiv]
Sheffieldveraai at semeval-2023 task 3: Mono and multilingual approaches for news genre, topic and persuasion technique classification
Ben Wu*, Olesya Razuvayevskaya*, Freddy Heppell*, João A Leite*, Carolina Scarton, Kalina Bontcheva, Xingyi Song
SemEval 2023 [Paper] [arxiv]

For a full list of publications, please refer to my Google Scholar.

Resources for Mechanistic Interpretability

If you’re interested in learning more about mechanistic interpretability, Callum McDougall’s ARENA tutorials are a great place to start. You get hands-on experience via a set of coding problems (with hints!) that start by implementing a transformer from scratch and progress on to replicating key interp papers and gaining proficiency with libraries such as TransformerLens, NNsight and SAELens.
Neel Nanda has tons of useful material on his website
The Anthropic team release monthly updates via their Circuits Thread.
As a counterbalance, I’d also recommend reading Stephen Casper’s sequence of essays. They provide a good critique of mech interp’s current failings as a field, highlighting a tendency to cherry-pick results, lack of rigorous evaluation, lack of practical applications, and non-competitiveness with non-interp methods.
I’d highly recommend applying to MATS if you want to contribute to technical AI safety research. I participated in the Winter 2023-24 program and found it tremendously beneficial: the research environment is really productive and exciting, and you get mentored by experienced researchers. Similar programs like LASR and SPAR also exist, but I don’t have first-hand experience with them, so can’t comment.

Contact

Academic email: bpwu1@sheffield.ac.uk
Personal email: 12benwu@gmail.com

Ben Wu

Research Interests

News

Selected Publications

Resources for Mechanistic Interpretability

Contact