By: Thomas Stahura
When I talk about “AI alignment,” I’m not talking about some diagonal line that relates intelligence to compute. No, what I’m talking about is the strangely old philosophical problem of how to get increasingly powerful artificial intelligences to do what we actually want, rather than what we merely say. Or worse, what we think we want.
I shouldn't have to explain why alignment is so important since these AIs aren't just playing Go anymore; they're deciding who gets parole, filtering your social media feed, diagnosing your illnesses, teaching your kids, and driving multi-ton vehicles down the highway.
Not to mention the money involved. It’s estimated OpenAI, DeepMind, and Anthropic average $10 million annually on AI safety (~1% of their compute). Safe Superintelligence (SSI), a company founded by ex-OpenAI Chief Scientist Ilya Sutskever, recently raised $3 billion.
But all the money in the world won’t help if we don’t even know what “alignment” really means. Thankfully, I took an intro to modern philosophy class last year, only to spend half the semester learning ancient philosophy.
Turns out philosophy, like most things, is understood through contrast. And if you want to understand the problem of AI alignment, you’d better start with the old philosophers, because they were wrestling with the problem of learning and the definition of knowledge long before anyone dreamed of gradient descent.
In roughly 369 BCE, Plato suggests that knowledge is justified true belief. Suppose you believe that the sun will rise tomorrow. This belief is true, and you can justify it by appealing to the laws of astronomy and your past experience of the sun rising every day. According to Plato, your belief counts as knowledge because it is true, you believe it, and you have a reasoned account for it. Now, if you’re building an AI, you might think: “Great! Let’s enable it with reason, program it to have justified true beliefs, and we’re done.” But, as usual, things aren’t so simple.
Because, in 1963, philosopher Edmund Gettier comes along and throws a wrench in everything. He presents these little puzzles, where someone has a belief that is true and justified, yet intuitively does not seem to possess knowledge. For example, imagine you look at a broken clock that stopped exactly 12 hours ago. But, by coincidence, you check it at the precise time it displays. You form the belief that it is 2:00, which happens to be correct, and your belief is justified because you trust the clock. Yet, most would agree you do not truly “know” the time, since your justification is based on faulty evidence. This is an example of a Gettier problem that reveals justified true belief can sometimes be true merely by luck. Now, if you’re trying to align an AI with human values, you’d better hope it doesn’t get “lucky” in the Gettier sense — generate the right thing for the wrong reasons, or worse, generate the wrong thing for reasons that look right on paper.
And then, just when you think you’ve got a handle on things, along come the postmodernists. Postmodernism is marked by skepticism, including the idea that knowledge must fit a strict formula like justified true belief. Instead, postmodernists argue that what counts as knowledge is often shaped by language, culture, and power, and that our understanding is always partial and constructed rather than absolute.
Now, let’s dig into this language thing a bit more. Think about Derrida, who points out that language isn’t some crystal-clear window onto reality. Words don’t just stand for things. They stand in for things, usually things that aren’t even there. That’s the whole point, right? I can talk about a cat without dragging one into the room. Language works because of absence, because of gaps. And meaning isn’t fixed by what some speaker intended. For example, you write an email and get run over by a self-driving tesla. Your receiver can still read the email even though your intentions are now… well, irrelevant.
More importantly, Derrida, following folks like Nietzsche, gets us suspicious about interpretation itself. Derrida argues there’s no final, correct interpretation of anything – not the Bible, not Plato, not the U.S. Constitution, and certainly not some vague instruction like OpenAI’s “ensure AGI benefits all of humanity.” Trying to pin down meaning is like trying to nail Jell-O to the wall. Philosophical language, the very stuff we use to talk about high-minded ideas like justice, truth, and marketing material is drenched in metaphor.
As Roderick put it:
“Is the word 'word' a word? No, because I have mentioned it and not used it. It has now become a token of a word... What I am trying to say here is that words are not things. That the attempt that philosophers have made to hook words to the world has failed but it’s no cause for anyone to think we are not talking about anything. See this doesn’t make the world disappear, it just makes language into the muddy, material, somewhat confused practice that it actually is.”
So, how the hell are we supposed to translate our messy, metaphorical, interpretation-laden language into the cold, hard logic of model weights without losing everything important, or worse, encoding the hidden biases and power plays embedded in our own mythology? You tell an AI “be fair,” and what does that mean? Fair according to who? Based on what metaphors? It’s not just that the AI might misunderstand; it’s that language itself is built on misunderstanding, on the impossibility of ever saying exactly what you mean and knowing it’s been received as you intended.
So here’s the punchline: AI alignment is not a technical problem, it’s a philosophical and political one. It’s about who gets to decide what “alignment” even means, whose values get encoded, and who gets left out. It’s about the power to define the good, and the danger that our creations will reflect not our best selves, but our resentments, and contradictions.
I'm optimistic though because while big tech is trying to cook up some universal recipe for 'aligned AI', probably based on whatever focus group data they collected this quarter, there’s another game in town: open source! Which promises everyone their own perfectly loyal digital butler.
It’s almost comical: OpenAI, after years of being “open” in name only, is finally tossing a model over the wall for the public to play with. If you have a GPU and an internet connection that is. People will align models to do stupid, dangerous, or just plain weird things. But maybe, just maybe, letting individuals wrestle with aligning models to their own contradictory values is better than having one monolithic, corporate-approved 'goodness.’
If language is inherently collaborative, if interpretation is endless, if values are masks for power, then maybe distributing the alignment problem is the only way to avoid the dystopia of a single, centrally-enforced 'truth.' It embraces the uncertainty Roderick talked about, instead of pretending we can solve it with a bigger transformer or a better mission statement. I believe that if we embrace the uncertainty and the collaborative potential of language, perhaps we can build not just smarter machines, but a slightly wiser, more self-aware humanity to guide them.