AI and the Proxy War
What the Paris sewers can teach us about assessment
The word proxy has been on an interesting etymological journey. By the time it entered English in the fifteenth century it was to do with ‘management’ or ‘administration’, and connected to the Roman role of procurator but its earliest Latin roots - pro (on behalf of) and curare (to care for or to attend to) - perhaps suggest a more human, relational meaning and that slight tension has persisted in the word.
By the 1600s, in English, it had come to mean the person who carried the authority on your behalf. At its best it meant someone who cared for your interests when you couldn’t be present yourself. There is an implicit trust built into the word.
That trust was most visible in the proxy marriages that were once reasonably common amongst the European aristocracy. Marie-Antoinette had a proxy marriage before she officially married the Dauphin Louis. In 1490, Maximilian of Habsburg was represented at his marriage to Anne of Brittany by his friend Wolfgang von Polheim. Bizarrely, Polheim even went to bed with Anne but wore a full suit of armour covering all but his right leg and hand. I’m not quite sure why his left leg posed more of a risk than his right but still.
Over time, its meaning shifted away from the sense of trusted human representation and became more abstract and impersonal. In the 1920s it took on a statistical meaning and referred to a measurable variable that stands in for a variable that is immeasurable. The overall wellbeing of a nation’s citizens is very hard to measure so we use GDP per capita as a proxy.
By the 1950s, the word had acquired a sense of concealment: a proxy war is a conflict stoked by but not directly involving the major combatants. In the 1980s, computing gave us the proxy server, a device that sits between you and the wider internet and handles your request so the destination never sees you directly. A word that originated from a sense of care and faithfulness has become a screen that hides the real actor.
Although we don’t use the word too often, the education system is built on proxies. The uncomfortable question we have to face up to is:
what happens when AI reveals that many of them stopped faithfully representing anything long ago?
The essay is perhaps the proxy that is most challenged by new technology. For years, we have used it as a way of measuring the cognitive processes that are otherwise invisible - reading comprehension, analytical thinking, imagination - but it no longer works that way because AI can generate a final product that bypasses the thought process entirely. Formulaic formats like the 5 paragraph essay, for example, are particularly vulnerable to AI; there are so many examples on the web that it literally learnt to write them at its motherboard’s knee. AI severs the link between the product and the underlying competencies.
Homework is a proxy for independent learning. Of course, this has always been a bit fragile and you never really know which parent has ‘cast an eye over it’ before hand-in. My favourite story of this is when Ian McEwan helped his son with an A-Level essay about his own novel Enduring Love - and only got a C. Ian McEwan seems very mild-mannered but I know of many parents whose personal chagrin has led them to angrily challenge a homework grade ‘on behalf’ of their children. With Gemini, you no longer need to bother your parents.
The exam grade is a proxy for competence in a subject. Aggregated exam grades are a proxy for the quality of a school. League tables built on those grades are a proxy for the suitability of a school for your child. We make one of the most consequential decisions in our children’s lives based on a proxy of a proxy of a proxy, each layer making the screen more opaque and the thing we really want to know less visible.
Having built entire systems on these proxies, it’s uncomfortable to confront the possibility that they may not be entirely fit for purpose and, instead, we deflect. We build better tools for marking essays instead of asking whether the essay is still measuring what we think it measures. Hence, the proliferation of AI-powered marking platforms which promise consistency, speed and, yes, even a free Sunday evening. They are impressive pieces of engineering - but I fear they might be solving a problem that is quietly changing shape around them.
It wouldn’t be the first time we have persisted with a brilliantly engineered solution to a problem that has long since ceased to be a problem.
.In 1866, the Paris poste pneumatique was created as a way of relieving overloaded telegraph lines. Users wrote their messages on pale blue forms - petits bleus - and dropped them into dedicated mailboxes on the street. Propelled by compressed air, the capsules whizzed through metal tubes laid in the Paris sewers at up to 40 km/h. At the receiving end, couriers called tubistes completed the delivery, at first on foot, then by bicycle and finally by motorcycle. By 1934 there were 427 kilometres of pneumatic pipes connecting 130 offices and in 1945, the number of messages peaked at 30 million. In ‘La Recherche’ Proust describes the joy of receiving a pneu and for John Steinbeck it was emblematic of the city:
It is a wonderful system. It has a sound and a smell and a speed that belongs only to Paris.
(Sweet Thursday, 1954)
Engineers continued to improve and refine the system but over time, it became more expensive to maintain and operate and it was finally closed in 1984.
Yes, that’s right. 1984.
Telephones had been ubiquitous since the early twentieth century. Telex was introduced in the 1930s. Faxes became widespread in the late 70s.
Extraordinarily, the government pneumatic lines remained in use until 2004 by which point mobile phones were practically universal, the World Wide Web was a decade old, SMS messaging was old hat and the Blackberry had been launched. Never let it be said that government institutions are slow to adapt and change.
Importantly, it didn’t fail because it didn’t work. It failed because the relationship between the infrastructure and the purpose it served had quietly collapsed. Nobody noticed because the capsules kept moving.
AI marking tools are the post pneumatique de nos jours: fast and brilliantly engineered, but no longer solving quite the right problem. Of course, they will be around for a while; they will be developed and we will continue to use them because we are as wedded to essays as the Parisiens were to les petits bleus but we might be better to focus instead on the real issue.
And that real issue is not that our proxies have been disrupted by AI. It’s that many of them had already drifted loose from the things they were supposed to measure - and we hadn’t noticed because everything still seemed to be moving. In 1975, the British economist Charles Goodhart gave this problem a name.
Strictly, he didn’t give it a name. He wrote it as a footnote to a dull monetary policy report and it was given his name. Nonetheless, it states:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
The anthropologist Marilyn Strathern re-framed it in 1997 and made it much more catchy. She said:
When a measure becomes a target, it ceases to be a good measure.
Proxies such as essays and exam grades worked as proxies because they broadly correlated with the things we were interested in but couldn’t directly measure. When we turned the proxies into targets, Goodhart’s Law kicked in and, rather than the essay being the means of observing understanding, it became the goal of the educational process. People started to optimise for that goal and it ceased to be a reliable proxy.
Campbell’s Law is closely related to Goodhart’s and, in 1976, Donald Campbell applied his principle directly to education:
Achievement tests may well be valuable indicators of general school achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways. (Similar biases of course surround the use of objective tests in courses or as entrance examinations.)
I know. He needs Marilyn Strathern to work her magic again.
Nevertheless, we’ve known about this for a long time. AI doesn’t create the Goodhart problem; it completes it. It represents the logical endpoint of years of separation between the proxy and its invisible partner. If the essay, rather than the understanding, has become the target , then AI just does what any rational engineer would do and produces the target output as efficiently as possible. It doesn’t need to understand anything because that’s not what’s being measured.
When a student gets AI to write an essay for them they are not cheating, they are being entirely rational and efficient.
There’s an elegant symmetry with a teacher then using a marking platform to mark that essay. Both bypass the thinking process and go straight to the final product. When you mark an essay, or anything that is supposed to represent a student’s personal expression, the grade is the least valuable part of the exercise. The nuanced, contextualised understanding of their current abilities is what you’re after; that’s how you work out what to do next.
The canisters might be whizzing around Paris but it’s pointless if no-one is reading the messages.
The pneumatique has another vital lesson for us. The engineers who maintained it understood the infrastructure of delivery: the tubes, the compressed air, the capsules. They didn’t have to think about why people were writing to each other in the first place. The mechanical system was about moving messages around, not understanding them. But what we should care about is the impulse to communicate that caused people to put those messages into the tubes in the first place.
We’ve done a similar thing in education. Our assessment infrastructure focuses entirely on the products - the essays, the grades - and doesn’t pay any attention to the processes that generate them. We don’t consider the drafting, the re-reading, the false starts, the struggle with a sentence that won’t turn out right. These are just hiccups in the means of production that students have to get over to produce the thing that matters.
But, in fact, these struggles are precisely what matters. By focusing exclusively on assessing proxies, we have blinded ourselves to the place where the learning happens. And it’s human learning that we should care about.
I’ve written before about desirable difficulties and a fascinating study was published last week by the Wharton School, University of Pennsylvania comparing the gains made by chess students with on-demand access to an AI tutor to those who had limited access only. Both groups made gains but the group with free access made less than half the gains of the other group, suggesting that they were missing out on the cognitive struggle which led to real, long-term progress. Even more strikingly, the paper suggested that even when students understood that over-using AI assistance was damaging their progress, they couldn’t self-regulate and fell into using it more and more.
This is not an argument against AI assisted learning but it is a reminder that we have to design systems that don’t remove the difficulty. For years, we have treated task completion as a proxy for learning; we were missing the point that the struggle to complete the task was actually the learning mechanism itself.
Robert Bjork and others have written extensively about the ‘fluency illusion’ where students (and teachers) tend to mistake ease of processing for genuine understanding. Equally, we tend to interpret effort as a sign of poor learning. When my younger son got a very good set of exam results I lost count of the people, teachers and students alike, who asked me if he was ‘clever or just worked hard’. The word ‘just’ does a lot of heavy-lifting in that phrase, suggesting that getting good grades through hard work was somehow a lesser achievement. My answer that maybe he was clever because he worked hard didn’t really wash; the bias is very deep-rooted.
Of course, the effect of this prejudice is that it drives a preference for low-effort, quickly completed tasks which are superficially satisfying and there is a clear risk that AI will supercharge the bias. If it’s polished, fluent output generated at great speed that you’re after, then Claude’s your guy. Students are understandably reluctant to challenge AI output simply because it is much more fluent than anything they can produce. The proxy of ‘writes clearly’ has never been more unreliable as a measure of ‘thinks clearly’.
I wrote last week about reading on screens compared to reading on paper and suggested that there are times when you need your students to slow down and genuinely engage with the complexities. We can think of reading a text as a proxy for engaging with its ideas but the ease with which AI can generate summaries casts doubt on that too. The temptation to skim a summary rather than plough through the 700-odd pages of David Copperfield is inevitably too much for many.
The problem is that reading is not just an information-delivery mechanism; it is a form of cognitive exercise. The friction of navigating a complex text, holding multiple narrative threads in your head and even just remembering who said what to whom was all doing important pedagogical work that we never explicitly valued or assessed.
Maybe it will help us if we return to an earlier meaning of proxy and the idea of faithful representation. Maybe we need to narrow the gap between the measure and the thing we are trying to measure. If it is in the process that the learning lies, then maybe that’s what we need to measure.
Teachers have always known that the quality of a student’s question is a very good informal proxy for their understanding. To ask a good question, you have to have a sophisticated understanding. AI makes this visible in a new way because interacting with AI requires good questions in order to produce good results. All the signs are that AI amplifies existing expertise. The more you already know about a subject, the better the questions you ask, the more critically you evaluate the output, and the further you can push your understanding. For those without that foundation, the same tools deliver fluency without comprehension.
A study at William and Mary University in 2024 found that 63% of student prompts to an AI teaching assistant were ‘unsatisfactory’ and that many simply pasted in their assignment questions without further information. Many expressed dissatisfaction with the tool’s ability to handle complex tasks but this was generally a reflection of their inability to formulate the right questions rather than the capacity of the tool. The study didn’t reveal a limitation of AI; it revealed a limitation of students’ thinking, made newly visible.
Grading the conversation a student has with AI - the quality of their prompts, their follow-up questions, the points where they challenge or redirect the output - tells us far more about their understanding than the polished essay that emerges at the end. A student who asks a vague question and accepts the first response wholesale is doing something fundamentally different from a student who interrogates, refines and pushes back. The prompt history is not a proxy for thinking. It is thinking, made visible in a way that a finished essay never was.
Talking to students about their work - what they found difficult, where they changed direction, what they still aren’t sure about - gets us closer still. A five-minute conversation reveals more about a student’s understanding than a thousand words of polished prose. It is also, not coincidentally, much harder to fake.
These approaches share something important. They narrow the gap between the measure and the thing we’re trying to measure. They don’t eliminate the proxy problem entirely - no assessment can - but they minimise it, because the assessment is no longer a product that stands in for thinking. It is the thinking.
If we don’t work towards those aims, we will be drawn into a proxy war where AI detection tools are weaponised and teachers and students are fighting over the wrong thing. The real conflict isn’t about who wrote the essay; it’s about whether the essay was ever measuring what we thought it was. We need to stop fighting over the proxy and start asking what it was supposed to represent. And then find better ways of seeing the thing itself.
The tubes are still there, buried under the streets of Paris, carrying nothing. For decades, engineers maintained them at growing expense long after the messages had found other routes. We’re in danger of doing the same thing, pouring resources into assessment infrastructure whose connection to learning has quietly collapsed. We can keep polishing the brass canisters and upgrading the compressors, but it’s time to stop being impressed by the speed of the tubes and start reading the messages.





This is a fascinating read. I really enjoyed the history, but even more so your explanation of the rupture between assessment and learning. The observation that AI is not the cause of the break, but the final straw is excellent. The disconnect between teaching and how we choose to assess is a problem that has been festering for too long. Thanks for bringing this problem so brilliantly to the light.
And I plan to steal this pithy but oh so insightful quote: When a measure becomes a target, it ceases to be a good measure.
Thanks!