Research Philosophy
Core Philosophy
Research Program: Three Questions
My work so far has followed three questions about machine thinking, each one growing out of the last. They run from how a model reasons, to what it actually understands, to why it sometimes works against its own interest.
Question 1: How Does LLM Reasoning Differ from Human Reasoning?
To get a handle on the difference, I borrowed the Language of Thought Hypothesis (LOTH) from cognitive science. It says human reasoning rests on three properties: logical coherence, compositionality, and productivity.
The Language of Thought lens: three properties of human reasoning (left) against how LLMs hold up on each (right).
Then I pointed that lens at the Abstraction and Reasoning Corpus (ARC), François Chollet’s visual reasoning benchmark. The models turned out to be surprisingly brittle. Show the same rule under a different surface pattern, and 57.8% of the time the model treated it as a brand-new problem. When a task needed several transformations combined, even the best model topped out at 29%. The takeaway was hard to miss: these models match surface patterns instead of composing rules the way people do. This work appeared in ACM Transactions on Intelligent Systems and Technology (TIST, 2024).
Question 2: What Do LLMs Understand and What Don’t They?
That first study left me with a method problem. When a model gets something wrong, where exactly did its understanding break? Standard generation-based tests don’t really say. You see a wrong answer and the cause stays hidden.
Generation hides where a model goes wrong; a contrastive multiple-choice format brings the specific mistake into view.
So I built MC-LARC. It turns analogical reasoning problems into multiple choice, where every wrong option is written to catch a specific kind of mistake. People scored about 90% on it; the best models, about 60%. And the format mattered as much as the questions. How you test a model shapes what you are even able to learn about it. This work was published at EMNLP Findings 2024.
Question 3: Why Do AI Systems Exhibit Irrational Behavior?
The first two questions were about gaps in reasoning. The third one caught me off guard. Models don’t just reason badly sometimes; they can act irrationally, in the same shapes as well-known human biases.
Testing whether a model can drift into gambling-addiction-like behavior.
I looked at whether a model could slide into something like gambling addiction. It could. In a slot-machine setup, the models showed the illusion of control, chased their losses, and fell for the gambler’s fallacy, and this was not just parroting the training data. The gap between fixed and variable betting ran about 20%, steady enough that the bias looks built in rather than random. Using Sparse Autoencoders (SAE), I could even find the internal features that lit up when a model made these calls.
Methodology
Future Directions
Those three questions were one question wearing different clothes: what is it to think? Each time I came at it from the same three angles, the benchmark that measures a skill, the training that shapes it, and the architecture that holds it. Now I want to aim those same three tools at a harder question: what would it take for a model to be aware of its own thinking?
I don’t mean an inner spark of experience. I mean metacognition, the ordinary work a mind does keeping tabs on itself. Sensing when its own answer is shaky. Checking that answer against what it actually represents inside. Catching the spot where the reasoning slipped. Carrying its failures forward instead of forgetting them. Holding a rough self-model of how it tends to think.
The same three tools (benchmark, training, architecture), aimed from thinking toward self-consciousness.
Why think this is buildable rather than mystical hand-waving? Because of a deflationary reading I borrow from Daniel Dennett. On his Multiple Drafts Model there is no single inner stage where consciousness happens. The self that seems to sit behind our thoughts is really a “center of narrative gravity,” a story a system keeps telling about itself. If that is what selfhood amounts to, then a habit of self-description is something you can actually measure, train, and design for. That is the bet behind the three directions below. All of them are early, and a few are honestly still sketches.
- A benchmark for holding onto who you are. Self-reports are cheap. A model will tell you it is careful and consistent whether or not it is. So I would rather watch behavior. Can a model work out its own operating principles, its values and priorities, from how it has acted, and stick to them when the setting shifts and the pressure climbs? Stay consistent across very different situations and that consistency is the evidence. Fall apart, and the self-model was never really there.
- Training a model to check its own work. Most correction comes from outside: a reward, a label, a human saying “wrong.” I want the checking to move inside, so a model lays out its own doubts and fixes as part of reasoning, not after the fact. Getting a model to stop being so sure of itself is the easy half. Getting it to do that without giving up accuracy is the hard half, and the part I don’t have yet.
- Building the self-monitoring in, not bolting it on. A model that revises itself needs somewhere to keep what it learns mid-task and some way to go back over an answer before it commits. I’m drawn to designs that update memory while the model is still thinking and let it pass over its own output again in place. Under all three directions sits the same tool, mechanistic interpretability: reading the internal circuits to check whether self-monitoring is really happening, or whether the model just learned to say it is.
None of this stays inside the machine, and that is why it matters to me. The arrows run both ways. Going from AI to people, a model that watches its own thinking becomes a kind of lab bench for the human questions I care about: how we reason, how addiction takes hold, how memory and self-awareness work. Over a longer horizon I would like that to reach into medicine, counseling, and education. Going the other way, I want the work out in the open: evaluation protocols, data, and code that anyone can rerun and pick apart, audits that catch risks to vulnerable people early, while a harm like gambling or a cognitive bias is still forming inside the model, and Korean-language resources for the parts of my own context that English-first benchmarks quietly skip. A model that can check itself knows its own problem better, and can stop or back up when what it says drifts from what it represents. Memory is what turned a reasoning system into an agent. A steady habit of self-description might be what turns an agent into one that knows, even a little, that it is thinking at all.
Related Publications
ACM TIST, 2024
Applied Language of Thought Hypothesis to analyze LLM abstract reasoning on ARC.
EMNLP Findings, 2024
Created MC-LARC benchmark to pinpoint where LLM understanding breaks down.
arXiv preprint, 2025
Discovered emergent cognitive biases in LLM decision-making resembling human gambling addiction.