I’m a second-year PhD student at the University of Sheffield, researching language model interpretability. I study the internals of neural networks to understand how they work. I’m fortunate to be supervised by Xingyi Song and Kalina Bontcheva. I’ve also collaborated with Wes Gurnee and Neel Nanda from Google DeepMind, studying how LLMs use specialised neurons to regulate their uncertainty.

Prior to pursuing my doctoral degree, I studied English Literature (BA) at the University of Cambridge. You can learn about some of my favourite books and poems here.

Research Interests

I think inspecting the internals of AI models can help us understand their capabilities and limitations. Lately, I’ve been using mechanistic interpretability methods to study uncertainty and confidence in LLMs, with the goal of improving model calibration, monitoring, and reliability.

News

Selected Publications

  • Confidence Regulation Neurons in Language Models
    Alessandro Stolfo*, Ben Wu*, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
    NeurIPS 2024 [arxiv]
  • Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science
    Yida Mu, Ben P Wu, William Thorne, Ambrose Robinson, Nikolaos Aletras, Carolina Scarton, Kalina Bontcheva, Xingyi Song
    LREC-COLING 2024 [Paper] [arxiv]
  • Overview of the CLEF-2024 CheckThat! Lab task 6 on robustness of credibility assessment with adversarial examples (InCrediblAE)
    Piotr Przybyła, Ben Wu, Alexander Shvets, Yida Mu, Kim Cheng Sheang, Xingyi Song, Horacio Saggion
    CLEF 2024 [Paper]
  • Don’t Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Labels
    Ben Wu, Yue Li, Yida Mu, Carolina Scarton, Kalina Bontcheva, Xingyi Song
    EMNLP Findings 2023 [Paper] [arxiv]
  • Sheffieldveraai at semeval-2023 task 3: Mono and multilingual approaches for news genre, topic and persuasion technique classification
    Ben Wu*, Olesya Razuvayevskaya*, Freddy Heppell*, João A Leite*, Carolina Scarton, Kalina Bontcheva, Xingyi Song
    SemEval 2023 [Paper] [arxiv]

For a full list of publications, please refer to my Google Scholar.

Resources for Mechanistic Interpretability

  • If you’re interested in learning more about mechanistic interpretability, Callum McDougall’s ARENA tutorials are a great place to start. You get hands-on experience via a set of coding problems (with hints!) that start by implementing a transformer from scratch and progress on to replicating key interp papers and gaining proficiency with libraries such as TransformerLens, NNsight and SAELens.
  • Neel Nanda has tons of useful material on his website
  • The Anthropic team release monthly updates via their Circuits Thread.
  • As a counterbalance, I’d also recommend reading Stephen Casper’s sequence of essays. They provide a good critique of mech interp’s current failings as a field, highlighting a tendency to cherry-pick results, lack of rigorous evaluation, lack of practical applications, and non-competitiveness with non-interp methods.

  • I’d highly recommend applying to MATS if you want to contribute to technical AI safety research. I participated in the Winter 2023-24 program and found it tremendously beneficial: the research environment is really productive and exciting, and you get mentored by experienced researchers. Similar programs like LASR and SPAR also exist, but I don’t have first-hand experience with them, so can’t comment.

Contact

Academic email: bpwu1@sheffield.ac.uk
Personal email: 12benwu@gmail.com