Jak "inteligentny trener" pomaga modelom językowym przełączać się między tekstem a kodem
Large language models have made a name for themselves as masters of reading, writing, and navigating the intricate world of language. Hand them a complex passage or an open-ended question, and they’ll usually dazzle you with convincing, context-aware answers. But put them in front of a math problem or ask them to figure out a logical puzzle, and their confidence wavers—sometimes even basic calculations trip them up.
These models are naturals at textual reasoning, but that skillset doesn’t always cut it for problems that require precision, logic, or calculation. Sure, LLMs are better than ever at churning out code, but writing code doesn’t always mean they really understand when or how it should be used to truly solve a task. Even when they do spit out code, it can miss the mark—sometimes it’s imperfect, other times just plain inefficient.
This curious gap caught the attention of a team at MIT. Their question: What if, instead of leaving LLMs to figure things out alone, we gave them a bit of coaching? That train of thought led to the development of CodeSteer, a lightweight digital assistant that acts like a coach on the sidelines. Its job? To nudge LLMs towards the right method—whether that’s regular text or a chunk of code—depending on the task at hand.
CodeSteer is deliberately small and nimble. Rather than tinkering with the heart of advanced models like GPT-4, the researchers chose to keep things modular. The mini assistant checks out the problem, looks over how the LLM handled it, and then gently suggests whether to continue reasoning with words or to jump over to leveraging code. It sticks with the model, prompting it step by step, until a correct solution emerges.
The results so far are impressive. LLMs, with CodeSteer’s guidance, show real gains in areas like solving math equations, filling out Sudoku grids, and even thinking through spatial-reasoning challenges. These models saw accuracy improvements of more than 30 percent—a leap largely thanks to CodeSteer’s ability to call out habitual LLM “laziness.” Left unaided, LLMs tend to reach for the shortest or most convenient solution, which isn’t always right. CodeSteer urges them to take the scenic (and correct) route, comparing answers with symbolic checkers and running its own verifications to make sure the code really works.
Of course, building and testing something like CodeSteer required plenty of data—so MIT’s team set out to create their own. They assembled SymBench, a diverse collection of 37 symbolic tasks drawn from math, spatial reasoning, and optimization. Armed with this new testbed, CodeSteer didn’t just keep up with the competition—it crushed it, boosting average problem-solving precision from just over 53 percent to more than 86 percent, outperforming nine other methods.
Perhaps the most promising feature of CodeSteer is its subtlety. It leaves the big LLMs untouched, acting as a refined guide rather than an overhaul. This means even smaller models, with CodeSteer in their corner, can tackle specialized challenges that often stump much larger, “smarter” models.
“Our method uses an LLM’s own capabilities,” says Yongchao Chen, the project’s lead author. By helping the model know when—and how—to code, rather than just relying on its “raw” abilities, even already-strong LLMs can get dramatically better. And the approach isn’t just academic: picture it helping robots pick their way across tricky ground, or lending a hand to untangle complex global supply chains.
Looking ahead, the MIT team wants to speed up CodeSteer and, possibly, merge the coaching into a single model—no separate assistant required. The work has already sparked a buzz in the field, with experts from both Google Cloud AI and DeepMind praising CodeSteer’s cleverness and potential to help AI ‘agents’ work better together. Supported by the Office of Naval Research and the MIT-IBM Watson AI Lab, this research is set to take center stage at the International Conference on Machine Learning.
For more details, read the full story at MIT News.