What's Missing From LLM Chatbots: A Sense of Purpose
LLM-based chatbots’ capabilities have been advancing every month. These improvements are mostly measured by benchmarks like MMLU, HumanEval, and MATH (e.g. sonnet 3.5, gpt-4o). However, as these measures get more and more saturated, is user experience increasing in proportion to these scores? If we envision a future of human-AI collaboration rather than AI replacing humans, the current ways of measuring dialogue systems may be insufficient because they measure in a non-interactive fashion. Why does purposeful dialogue matter? Purposeful dialogue refers to a multi-round user-chatbot conversation that centers around a goal or intention. The goal could range from a generic one like “harmless and helpful” to more specific roles like “travel planning agent”, “psycho-therapist” or “customer service bot.” Travel planning is a simple, illustrative example. Our preferences, fellow travelers’ preference, and all the complexities of real-world situations make transmitting all information in one pass way too costly. However, if multiple back-and-forth exchanges of information are allowed, only important information gets selectively exchanged. Negotiation theory offers an analogy of this—iterative bargaining yields better outcomes than a take-it-or-leave-it offer. In fact, sharing information is only one aspect of dialogue. In Terry Winograd’s words: “All language use can be thought of as a way of activating procedures within the hearer.” We can think of each utterance as a deliberate action that one party takes to alter the world model of the other. What if both parties have more complicated, even hidden goals? In this way, purposeful dialogue provides us with a way of formulating human-AI interactions as a collaborative game, where the goal of chatbot is to help humans achieve certain goals. This might seem like an unnecessary complexity that is only a concern for academics. However, purposeful dialogue could be beneficial even for the most hard-nosed, product-oriented research direction like code generation. Existing coding benchmarks mostly measure performances in a one-pass generation setting; however, for AI to automate solving ordinary Github issues (like in SWE-bench), it’s unlikely to be achieved by a single action—the AI needs to communicate back and forth with human software engineers to make sure it understands the correct requirements, ask for missing documentation and data, and even ask humans to give it a hand if needed. In a similar vein to pair programming, this could reduce the defects of code but without the burden of increasing man-hours. Moreover, with the introduction of turn-taking, many new possibilities can be unlocked. As interactions become long-term and memory is built, the chatbot can gradually update user profiles. It can also adapt to their preferences. Imagine a personal assistant (e.g., IVA, Siri) that, through daily interaction, learns your preferences and intentions. It can read your resources of new information automatically (e.g., twitter, arxiv, Slack, NYT) and provide you with a morning news summary according to your preferences. It can draft emails for you and keep improving by learning from your edits. In a nutshell, meaningful interactions between people rarely begin with complete strangers and conclude in just one exchange. Humans naturally interact with…

