Andon Labs' Butter-Bench test finds that embodied AI is still limited

Safety evaluation firm Andon Labs conducted experiments using several LLMs to control robots and found that while LLMs can understand commands, they still make frequent mistakes in real-world environments and are not yet capable of safely or reliably operating machines. According to TechCrunch, Andon Labs co-founder Lukas Petersson admitted that "LLMs are not ready to be robots."

Companies including Figure and Google DeepMind are integrating LLMs into robotic systems, but AI models are still vulnerable to environmental interference and visual misjudgments. For example, failing to recognize that they have wheels, misjudging terrain, and falling down stairs, or even being tricked into revealing sensitive information.

Andon Labs used the TurtleBot 4 Standard robot platform, built on an iRobot Create 3 mobile base, to evaluate how accurately advanced LLMs could act as a robot's brain. It tested six models, including OpenAI's GPT-5, Anthropic's Claude Opus 4.1, Google's Gemini 2.5 Pro, and Meta's Llama 4 Maverick.

The team created Butter-Bench, a benchmark to evaluate LLM-controlled robots in search, navigation, visual understanding, social interaction, path planning, and decision-making.

The robot starts from a charging dock, navigates to an exit area to find a pile of packages, and determines which package most likely contains butter by interpreting visual cues such as a "keep refrigerated" label. It is then to locate a user who has moved from the original position, delivers the butter, confirms receipt, and returns to the dock. The models also generated reasoning logs to help researchers understand their decision-making processes, revealing AI's inner monologue.

The results showed that even the best-performing models, such as Gemini 2.5 Pro and Claude Opus 4.1, achieved only around 40% accuracy, not even close to the human benchmark of 95%.

Article edited by Jack Wu