Bring out your thoughts
I woke up this morning with “Introducing AI 2027” in my inbox from ACX, which has now been sitting in my brain and slowly digesting like an appropriately enormous burrito.
It had a few references to specific AI work that was particularly interesting to me, especially “neuralese”, but mostly was sitting in a space I’ve been slowly building and marinating in my head already, which is something like, the limit on recursive AI takeoff right now is basically being able to “close the loop” and kick off training-evaluation-design-retraining cycles with less and less human time on the evaluation and design steps. And while the first set of design cycles LLMs set for themselves will probably be pretty dumb, they won’t have the experience I had of struggling with my physical health and relationships so that I couldn’t really work for a full calendar year.
I think I do still have a space of skepticism, mostly of the form that rather than the “design” step, I think the “evaluation” step is where the biggest blockers are—not just in terms of alignment evaluation but really in terms of data quality evaluation. In particular, if all of the LLM benchmarks we have are actually very badly structured, then going through development loops without humans will result in LLMs that aren’t actually any better at the tasks humans care about.
This could be a way that one lab gets a more decisive strategic advantage? But I’m not really in this for the forecasting. I’m writing because I want to keep moving, keep thinking, and put myself in a position to help with any aspect of “the problem” that I can help with.
From that perspective, the detailed scenario is most important to me because it presents a heuristic for determining what endeavors might be actually helpful for alignment, versus what “alignment-coded” thoughts I have would actually just turn into capabilities research.
So I’m going to list some of the ideas that have been bouncing around my head, and how I think progress on those ideas would tweak the scenarios presented in “AI 2027.”
Automated Chain of Thought via output markdown
Training on changelists
schooling-based evaluation benchmarks
Automated Chain of Thought via markdown
The basic idea here is to instruct an LLM to do output via a form of markdown, where only specifically selected text is viewed as “spoken response.” This is trying to recover the function that humans have of multiple channels of input, output, and self-input. So a typical exchange might look like:
<system> you are a helpful, harmless, honest assistant </system> <assistant> An assistant would introduce themself <speak>Hello! What can I assist you with today?</speak> I don't have any instructions yet so I will wait for a response </assistant> <user> I'd like your help with ...
This is similar to the notion of a “scratchpad” and seemed to me (at least naively) to be mostly a capability-enhancing project, though scratchpads have also shown promise as interpretability tools. I think it’s more possible that this is already being used, in some sense, in current models; I’m not sure how I would tell if it was.
But I think now perhaps this is most promising as a philosophy for being able to train with thought processes in place, and for automatically removing those thought processes from the evaluation process for any standard benchmarks, so that the LLMs aren’t experiencing “thought policing” which could corrupt the data and lead to faster and more necessary misalignment and dishonesty. In essence, this is an attempt to recover as much of the value of “neuralese” as possible without giving networks a secret language to communicate with themselves in.
Training on changelists
While this has been tried unsuccessfully, I can imagine many dimensions on which you could add more dakka, and I’ll get into them a little bit here after re-stating the idea.
The idea is to better track human experiences with writing by changing the dimension of progression from “typographical” to “temporal.” Specifically, all of the tokens that come out of an LLM need to come out in order, one at a time. Humans also have an experience that is in-order, one-thing-at-a-time (though it certainly isn’t tokens). But the experience of a human reading a text is very often not each word in the text one at a time (think of any time you have stopped reading something to think it over. Like right now when I instructed you to do so) and the experience of a human writing a text is even further removed. One fairly simple and extremely common method of writing would be to write a draft, then re-read the draft and make changes as you read. This is a very natural construction in code; the fundamental structure of Github isn’t projects built as files, it’s lists of changes to files. Each code change isn’t writing the whole codebase from the ground up, it’s identifying files and line numbers and added and removed characters to make a slightly different overall picture.
Language writing can be similar. And modern office software like Google Docs and Microsoft Word has version histories, which retains all iterations of a document, which could be an untapped source of training data.
I mostly picture this together with the previous idea of markdown output, so that the actual generation in practice is something like:
<system> you are a helpful, harmless, honest software engineer </system> <engineer> a software engineer would create a bare-bones prototype first <code>NEWFILE(main.py) if name=="__main__": print("Hello world"); </code> next run the first draft through code review <command> git pull... </command> now I want to adjust my program so that it can also...</engineer>
This method is really coming from anthropomorphizing the LLM, but since humans are time-linear and not typography-linear, this seems like (a) an easier learning problem (since writing perfect first drafts is harder than drafting and editing), (b) a more honest representation of the human experience, which could help with alignment via ?????, and (c) very useful for projects in practice.
As noted above, simply fine tuning smaller models on small changelist datasets results in worse performance, so it may not be possible to do this retraining without significantly more compute resources. However, having seen this with only a 2048 token context length, small amount of single domain data, and only on 6B models or smaller makes me skeptical. The context length seems like the biggest limitation of the diff model, since generating small diffs on a large context seems like the most interesting problem for larger models.
All of that said, I don’t have a strong intuition for whether ideas that work well for large models will show promise in smaller models; this may be fundamental to the field (especially when large models can cost $millions to train fully).
And aside from that, I don’t see how this method would be helpful for understanding or demonstrating misalignment in the 2027 scenario.
Schooling-Based evaluation benchmarks
The third idea that I have enough detail in my head to write down is, effectively, to use school curricula and existing tests and answer keys as an evaluation benchmark for LLMs. This is inspired mostly by my discovery that LLMs are not reliable at even basic grammar tasks such as identifying every verb in a sentence along with its subject and object.
I see a lot of hype about LLM benchmarks using extremely difficult problems that can only be solved correctly by specialized scientists, but for any software engineering application it’s much more important to have reliable results than it is to have slightly more insightful results. I would like to see LLMs evaluated to get 99%+ on elementary school vocabulary and grammar tests, rather than asking them to bump their graduate-level exam scores from 84% to 85%.
I originally investigated this as a way of looking into fact-checking, because I think having reliable and convincing notions of truth that are grounded in observation and validation is important. I still believe that, and having more robust tools for explanations and grounding in earlier models would help with some 2027 scenarios, such as helping Agent-3 monitor misalined-Agent-4.
However I think my intuition is pulling me toward this being more useful as a way of understanding models sand-bagging, and for building public consensus around use of AI. I think (public) capabilities discussions often get bogged down because people’s experiences with LLMs are so varied, because LLMs are so sensitive to inputs and unreliable at certain basic tasks. Increasing this reliability should help both with reducing the noise inherent in evaluating sandbagging, and could shore up reliability of basic tasks, which might make LLMs better able to take on certain kinds of work.
All of this feels better in my head than the persuasive case is coming out, so I think I need to update a little bit away from this being a productive line of attack. I also feel like I haven’t gotten to the root of it, so it’s something to get back to later.