Research — LAThomson

My work

… in order of recency…

Prism: Automating Science-of-Evals Research for AI Safety

[2026] (supervised by [Victoria Krakovna])

An agent scaffold designed to automate science-of-evals research work.

This scaffold, built on top of [Inspect AI], [Claude Code], and the [Claude Agent SDK], is designed to automate work on understanding eval dynamics and explaining model behaviours with respect to their environment.

Icon for Prism: a prism separating a white light ray into different colours that form nodes, some of which are filled as if selected

A Framework for Eval Awareness

[2026] (supervised by [Victoria Krakovna])

Setting out a conceptual framework for key research directions in evaluation awareness.

This [blog post] was the result of the first three weeks of MATS 9.0, where I developed this framing to help me identify promising and neglected research directions for mitigating eval gaming.

Figure from A Framework for Eval Awareness: a table categorising the different ways a model might behave in an evaluation

Agentic Monitoring for AI Control

[2025] (supervised by [Tyler Tracy])

An initial investigation into how we might empower trusted monitors with agency.

See my [blog post] for an introduction to the research direction alongside some initial results and discussion.

Figure from Agentic Monitoring for AI Control: a hand-drawn image of three different monitoring setups and how they might differ

Cooperation and Control in Markov Delegation Games

[2025] (supervised by [Lewis Hammond] and [Oly Sourbut])

Formalising these two key dimensions along which multi-agent delegation games can produce bad outcomes for humans.

This was carried out as part of my Master’s year at Oxford; see my [report]. [Note: I’d be keen to finish this work some day and turn it into a workshop paper!]

Figure from Cooperation and Control in Markov Delegation Games: a complicated game-theoretic theorem about Markov Delegation Games

Model Models: Simulating a Trusted Monitor

[2025]

Can an untrusted model predict how a trusted monitor will score its solutions?

This was part of the Apart Research [AI Control Hackathon] in March 2025; see the [project page] containing a report and the codebase.

Title of paper reads Model Models: Simulating a Trusted Monitor

Games for AI Control

[2024-5] (in collaboration with [Charlie Griffin]; supervised by [Alessandro Abate] and [Buck Shlegeris])

Introducing a game-theoretic model for AI Control settings.

See the [paper] and [blog post].

Figure from Games for AI Control: two plots side by side showing Pareto frontiers for control protocols in two different settings

Towards shutdownable agents via stochastic choice

[2024] (supervised by [Elliott Thornley])

Working on a proposal to solve the corrigibility problem by training agents to have incomplete preferences.

I briefly worked on this project through the [Future Impact Group] programme; you can see the resulting [paper] which was accepted to [TAIS 2025].

Figure from Towards shutdownable agents via stochastic choice: a gridworld showing an agent with branching paths towards coins of different values and a 'shutdown delay' button

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

[2023] (supervised by [Francis Rhys Ward])

Investigating the extent to which belief consistency and deceptive behaviour scale with model size.

I began working on this project through the AI Safety Hub Labs programme (now [LASR Labs]). See our [blog post] and a [follow-up paper].

Figure from Tall Tales at Different Scales: a scatter plot showing a trend of increasing belief consistency as models scale