The Only True Useful Multimodal Agent Before 2027

Many of the LLM researchers including me, and the more general audience, have been super excited recently about the idea of “Digital Artificial General Intelligence” (Digital AGI), like the kind that we have seen in all those science-fiction movies. It is a superhuman brain that is all-knowing and all-capable in the digital world. Once it is connected to the Internet, it can make as much impact to the human society as it likes and replace any work that one can do remotely (essentially serving as a billion-worth company on its own). Especially with the recent rapid developments in Large Languge Models (LLMs) and Reinforcement Learning (RL), it seems that we are only last few miles away from this accomplishment, probably the most important one in the human history. However, amid all the hypes of AGI, I think it is important that we take a more realistic perspective of what we expect AI to do for us. While this might be a little on the pessimistic side, I would not call it a bitter lesson. Rather, it is an exciting lesson for us because human creativity would still play an important role in the future path.

Screenshot 2025-04-26 at 10.53.25.png

I have been working on multi-modal agents and RL quite a bit (with leading works like DigiRL, DigiQ and PAE) since the AI models were barely able to recognize the buttons on GUI interfaces (e.g. being able to click and open the chrome app from the home screen). We had this ambition that AI would be trained be able to do whatever we can do with a computer, if it can use the computer in the same way as we humans do. When I say in the same way as we humans do, I literally mean the same way, including seeing the screenshot, touching a specific part of the screen to click on the link, scrolling down to see more content, etc.

When I started working on it a year ago around the time when o1 was released, I thought that the necessary technologies like LLM, RL, and self-improvements were already in place for getting us there with more engineering. However, I was surprised to find the progress on that regard has been little over the last year (despite the optimistic progress on Math and Coding). For example, WebArena is released in July 2023 and if you work on LLMs you will know how ancient a two-year-old paper is (the first GPT4 is released two years ago, and at that time one of the main AGI players XAI was not even founded). However, on the leaderboard as of Apr.2025, the best model can still only achieve around 60% accuracy (including OpenAI Operator and Claude Computer Use). And the crazy thing is, as shown below, the tasks on WebArena are only daily and common GUI navigation tasks that are not supposed to be hard at all! Around 50% of the tasks can be finished within 10 clicks away and 90% of the tasks can be finished within 20 clicks away. If we think about it, accuracy of 60% means that every 10 times we ask the AI to cancel an order 307 for us, 6 times it does the thing we want it to do, and 4 times it either cancels the wrong order or tells us that it has canceled the order with the order still being there! If that is the case, I would rather go ahead and take 2 minutes to cancel the order myself, instead of having to double-check whether the ticket has actually been booked or not. Then what’s wrong with the multimodal agents? In this blog I will analyze what’s wrong from both a technical perspective and an economical perspective and what would be something right under such reasonings.

Screenshot 2025-04-26 at 10.50.55.png

This blog would not be possible without the inspirations that I have drawn from talking to top-notch RL/LLM researchers at ICLR 2025. They have all found my conclusion to be surprising but at the same time my arguments convincing.

A cautiously optimistic perspective of magic of RL works for LLMs

With the emergence of the reasoning capabilities in LLM such as o1 and R1, the application of RL to LLM seems to suddenly become the hottest topic in the year of 2025 and all the LLM groups that I know both in academia and in the industry have some projects that involve the use of RL for LLMs. Ideas are being scooped on a weekly basis. The field has been so competitive that researchers do not even have the time to write a paper and have to release a notion page first before they get scooped (e.g. PRIME, DeepScaleR, both are very good paper).

However, because of me being affiliated with both an RL lab and a an LLM group , I had a curious observation of the distinct reactions to this progress from the RL community and the LLM community.

For many researchers that have been historically working on LLM and not familiar with all the deep RL literature, the application of RL to LLM seems to be a magic trick that improves the fundamental capabilities of LLMs entirely based on the synthetic reasoning traces generated on their own. This is a particularly valuable free lunch considering

1. that the gain from pre-training has slowed down (e.g. talk from Ilya Sutskever at NeurIPS 2024, and general reactions to the release of GPT4.5) and
1. that the second stage RLHF does not noticeably improve the capability of LLMs across exams (Appendix B from GPT-4 Technical Report). The effectiveness of synthetic data has always been the the center of the debate of LLM research (Nature) while RL has been joked to only work for the “toy” and “unrealistic” MuJoCo Gym environment.