Token LLMs

LLMs operate solely by reading in text and then generating output text by predicting the text that would come after it. Recently, Anthropic announced their Mythos model that is capable of performing security audits on codebases to find vulnerabilities. It may be that Anthropic just happened to discover their language based AI was good at security auditing, but there should be a larger investment into training AI to ingest something more structured and information dense than human language.

LLMs probably have some kind of thinking process that is similar to how you and I think. They input text, have some kind of internal process that can work on abstract ideas, and output text. That’s the basic procees that I use when responding to someone’s question. Specifics of LLM thoughts are probably different, but if they’re similar to a human’s thought process then we’re being hugely inefficient in how LLMs ingest information when it comes to software and software security.

Vulnerability research consists of reading a lot of code to build up a mental model of how all the parts fit together and then determining if some invariant is broken somewhere. There are many substeps in that process that have already been solved to some extent. Or even tools and algorithms that provide extra information that makes each step easier.

Programs are written as text but you can only reason about them after lexing and parsing to an abstract syntax tree. Programmers automatically do that while reading code. Poorly written programs may be very difficult to parse to its corresponding abstract syntax tree. Lexing and parsing code is a solved problem that elides the need for an LLM to do the same task. LLMs that focus on software tasks should support ingesting at least lexed programs and ideally some form of syntax trees. Properly parsed input is one fewer tasks that an LLM can hallucinate or get slightly wrong. That should lead to much better output for software related tasks.

Static analyzers also provide lots of rich information on the semantics of software and the properties of functions and variables. Static analyzers provide information like what set of values a variable can have at different points in a program, what variables are live, what functions call other functions, as well as any other kind of linting information that is available. Not all information that static analyzers deduce will be 100% precise, but extra information will provide LLMs with better starting information to do their own analysis.

LLMs have proven to be invaluable tools. There seems to be a lot of efficiency and effectiveness left on the table by LLMs solely working on text based representations of programs. Tokenizing programs differently and providing extra metadata about programs seems like an obvious research area to make LLMs more effective at programming tasks.