Harsh Trivedi, a doctoral researcher at Stony Brook University, was recently invited to present his work at the White House. His project, AppWorld, promises to revolutionize how we automate daily tasks in our digital lives.
Stony Brook, NY, Oct 19, 2024 - Stony Brook University Professor Niranjan Balasubramanian and Ph.D. researcher Harsh Trivedi were invited to represent the university at the White House Office of Science and Technology Policy (OSTP) launch event to celebrate the first allocation of the NAIRR Pilot.
Their project was one of only 35 nationally to be recognized by the National Science Foundation (NSF) and the Department of Energy (DOE) through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Harsh Trivedi presenting at the White House
The NAIRR Pilot aims to connect U.S. researchers and educators to computational, data, and training resources needed to advance AI research. The selected projects span critical areas such as deepfake detection, AI safety, and medical diagnoses. Harsh’s Project, AppWorld, addresses a dire need. Most daily tasks in our digital lives involve a series of complex activities over multiple mobile applications, as well as mindful reasoning and decision-making.
For example, even the mundane task of ordering groceries for a shared household turns into a complex task over multiple apps — finding a grocery list from a notepad app, checking your roommates’ requests on a messaging app, and ordering on a grocery app. Automating this task is not as easy as it may seem. It requires opening multiple applications, understanding how to interact with their interfaces, communicating the necessary information, and performing reasoning and decision-making in real-time, while also avoiding undesirable outcomes, like ordering an expensive grocery item that no one in the house will end up using.
The tools we use for such tasks today are not up to the mark. To fill this gap, Trivedi and his team invented AppWorld.
AppWorld
AppWorld is a simulated world — with 9 simulated apps, about 100 fictitious users, a large database, and over 450 APIs — where autonomous agents executing digital tasks can operate without any real-world consequences. The team built this environment to set a new benchmark for the tools we can use to automate daily tasks, building a suite of 750 complex tasks we face in everyday life. From “Return my last Amazon ordered shirt & buy it in one size larger” to “Play my Spotify playlist with enough songs for the workout today.”
Then, the team evaluated various existing LLMs — including ReAct and PlanAnd-Execute, which use GPT4 (Open AI, 2023), LLaMA3 (Meta AI, 2024), and DeepSeekCoder (Guo et al., 2024). They noticed that even their best approach (GPT4O with ReAct) gets only 48.8 and 30.2 goal completion scores on the normal and challenge test sets on AppWorld.
“Designing AppWorld was a substantial engineering effort, written by authors with many years of NLP and software development experience,” Trivedi noted. “Our careful quality control measures have resulted in a high-quality simulator and benchmark that present a new challenge for the NLP community, and will substantially advance research on LLM-based autonomous agents.”
Trivedi’s advisor, and Stony Brook University Professor Niranjan Balasubramanian, adds, “This recognition underscores the importance of Harsh's work. AppWorld represents a significant step forward in our ability to develop and test AI agents capable of understanding context and handling complex, real-world tasks.”
As Trivedi and his team continue to refine and expand AppWorld, they hope to see it adopted widely in the AI research community. “Our goal is to improve AppWorld further, by establishing new benchmarks, evolving our agents, and study how they might perform in the real world.”
With the support of the NAIRR Pilot and the recognition from the broader AI research community, Stony Brook University’s AI innovations are poised to play a significant role in shaping the future of AI technology.
As these advanced AI agents become more capable of handling complex, real-world problems, they will have the potential to transform how we interact with digital systems in our daily lives.
Communications Assistant
Ankita Nagpal