Are you happy?
Voice assistants are the user interfaces of our future. Voice user interfaces (VUIs) allow us to communicate in one of the most natural ways possible—talking, just as we would with any of our friends or colleagues. While VUIs once felt like a novelty and weren’t very robust (we can all remember the days when the response to just about any command was “Do you want me to search the web?”), today they’ve integrated into our routines, helping us check the weather, play music, and get directions. They’re even in our workplaces.
As the popularity of voice assistants grows, so do users’ expectations for seamless, human-like interactions. How do we evaluate if the assistants we build are satisfying our users? What does it mean for a user to be “happy,” and how do you even measure happiness?
On the MindMeld team, we’ve been investigating how to evaluate users’ experiences when talking with Webex Assistant, the first conversational assistant for the enterprise space. We answer the question “are users happy?” by developing a quantitative framework to address a historically qualitative user experience problem. We hope this investigation sparks interest in how we can think about measuring user experience online and in real-time.
Why is evaluating user satisfaction so difficult?
Evaluating user experience with artificial intelligence (AI) products is a difficult problem for a few reasons: AI systems are probabilistic in nature, natural language is variable, and user satisfaction is subjective. We expect the system’s output to be variable since AI is non-deterministic; deciding if the result of the system is expected variation or an error is one layer of difficulty. Not only can the output of the system vary, but the input from the user varies, too. People speak in different ways, using different words and sentence structures, so the system must be robust to understanding lots of natural language variation. Finally, how users are feeling is hard to understand, especially when you can’t ask them, and even harder to quantify. What satisfies a user can differ from individual to individual, and one user’s feelings can change based on the context and their expectations.
Previous research focused on understanding users’ experiences falls into two main categories: accuracy-based approaches and user studies & surveys. In accuracy-based approaches, accuracy is used as a proxy for satisfaction. If the system has high accuracy, the user is likely satisfied. However, these approaches often rely on single utterances and can miss the overall experience of a conversation. For example, if a user says “next page,” and the system identifies the query correctly and moves to the next page, this would be a success! But if the user says “next page” several times, this might indicate that they haven’t been able to find what they were looking for. That potential user frustration wouldn’t be captured when just looking at the single utterance accuracies. User surveys & studies provide a great window into user behavior, but conducting longitudinal user studies is costly in terms of resourcing participants and time spent on qualitative & quantitative data analysis. This approach is much harder, if not impossible, to use at scale. User studies don’t use real-time user data and take place in artificial lab settings, which might differ from real users’ experiences.
We try to take the best of each of these approaches, while addressing some of their shortcomings. We want to create a system that captures real users’ judgements, focuses on the larger user experience at the level of the conversation, and uses real-time data, so our approach can be scalable and automatic.
At a high-level, our framework:
1. Captures interactions at the level of the conversation
2. Automatically classifies conversations based on their conversational goal and conversational end state
3. Automatically assigns conversations a satisfaction label
The first challenge we tackle is how to capture users’ interactions with Webex Assistant. This is especially important for conversational AI, where analysis can happen at many different levels of an interaction. We choose to focus on conversations. We define a conversation as an interaction initiated by a user or the assistant, which can have single or multiple turns, and contains one conversational goal.
To capture each conversation, we introduce a common ID to thread that conversation from beginning to end. We log an event, called the “trigger,” each time a conversation is initiated. The trigger event includes the conversation’s unique ID and the goal of that conversation. For us, conversational goals most closely map to use cases, like calling a person or joining a meeting. Any queries the user says that move them towards the completion of the goal of the use case count as part of that conversation.
The image below shows an example of what we consider a conversation. Here, we’ll take a look at the “call person” use case. The conversational goal of the “call person” use case is to, ideally, successfully call a person.
We capture all the turns taken between the user & the assistant. When the conversation ends, we log the final state with the same ID as the trigger event. Our common ID allows us to follow the course of the conversation as it unfolded in real-time. In our example, we would log all these queries as part of one conversation with the conversational goal of “Call Person.”
Conversational end states
Retrospective analysis of historical data uncovered patterns in how user’s conversations with Webex Assistant end. After manual analysis, we decided on four categories to capture conversational end states. Here are the possible categories we use to automatically classify conversations by their end states:
Fulfilled the assistant successfully fulfills the user’s goal
Error the assistant fails to fulfill the user’s goal
Exited the user abandons the conversation or cancels
Modified the user decides to change part of the request or restart the conversation
Here are examples from the “call person” use case:
Now that the conversation has been captured from beginning to end and it contains a label for the conversational end state, the next step is to automatically assign a satisfaction label. The goal of the satisfaction label is to capture how users might feel after having a conversation. We wanted these labels to be user-friendly: high-level enough to understand at a glance, but granular enough to capture meaningful distinctions between users’ experiences. We use the following satisfaction labels:
Happy the user’s goal was successfully fulfilled
Sad the user’s goal was not met
Friction the expectations of the user are not met
Again, here are some examples from the “call person” use case:
Results of our framework
To recap our framework, we’ve threaded utterances to create conversations, automatically classified those conversations by their end state, and then automatically assigned a satisfaction label to understand users’ experiences.
Now, we’ll take a look at the results of this framework and what it helps us do. We’ll consider data from a full week of user interactions with Webex Assistant for our three most popular use cases: “call person,” “join a meeting,” and “join a Personal Room” (data not representative of real Webex Assistant performance).
At a glance, we can see how users feel when using Webex Assistant features—which are driving user happiness, and which might be causing the most difficulty. If “call person” shows a spike in “friction,” for example, we could investigate just those conversations. We might find that all the “friction” conversations happened on devices in Thailand, where users natively speak Thai, not English. We might hypothesize that the assistant had difficulty understanding Thai names.
Knowing how well each feature is performing in real-time allows our product team to track the real-time satisfaction of each feature and quickly identify & investigate issues. Depicted in a live dashboard, these valuable insights help us ask the right questions and directly impact the product roadmap.
Verifying our framework
We felt confident that these labels captured user satisfaction since we based them on user data, but we didn’t stop there. To be sure that the predictions we make about users’ experiences actually capture how users feel, we asked human annotators to label the data using the same satisfaction labels that our system uses: “happy,” “sad,” and “friction.” Annotators were instructed to put themselves in the shoes of the user and ask themselves, “how would I feel after this interaction?”
There was significant agreement between the human-labeled and system-labeled data (75% agreement, κ = 0.66). This comparison gives us confidence that our algorithm captures a realistic picture of user satisfaction. We feel confident that our framework successfully predicts user satisfaction that’s consistent with what real humans are feeling.
With this approach, we’re able to quickly get a snapshot of what users experience when using different Webex Assistant features. We’ve taken something as subjective as users’ happiness and broken it down into quantitative labels that can be automatically applied based on the conversation, end state, and conversational goal. We offer this as an attempt to think about how we can quantify user experience and shift the mindset of understanding users’ happiness to include quantitative methods.
About the authors
Chelsea Miller is a Data Analyst on the Webex Intelligence team at Cisco Systems. She holds a Master’s in Linguistics from UCSC, where she conducted research investigating how humans process language in real-time. Post-grad, her interest in how we understand each other, specifically how machines do (or don’t!), led her to work on conversational assistants. As a Data Analyst, she tackles problems like conversational analysis, how to securely & representatively collect quality language data, and voice interface usability.
Maryam Tabatabaeian is a Data Scientist working with the Webex Intelligence team at Cisco Systems. Before joining Cisco, she finished her Ph.D. in Cognitive Science, doing research on how humans make decisions. Her passion for studying human behavior, along with knowledge in data science, made research on voice assistants a fascinating area to her. She now designs metrics and data models for evaluating user interactions with voice assistants.
Click here to learn more about the offerings from Webex and to sign up for a free account.
Sep 16, 2021 — Aruna Ravichandran