Understanding the Inner Workings of Large Language ModelsMay 28, 2025 Recent advances in AI interpretability are shedding light on how large language models work, revealing both their capabilities and potential biases.
Alignment Auditing: Uncovering Hidden Objectives in Language ModelsMarch 21, 2025 Anthropic researchers explore alignment audits, a process for investigating hidden objectives in language models, using a blind auditing game and various techniques.